Effective prompt management is a cornerstone of building reliable and adaptable LLM applications. As discussed earlier in this chapter, treating prompts as critical configuration artifacts, subject to versioning, testing, and controlled rollout, is essential for operational maturity. Simply changing a prompt string directly in production code leads to unpredictable behavior and makes systematic improvement nearly impossible.
This practice section provides a hands-on example of building a basic workflow for managing and evaluating prompts. We'll focus on creating a repeatable process to compare different prompt versions against a predefined evaluation set, forming the foundation for more sophisticated prompt engineering operations.
Imagine we have an application that uses an LLM to summarize articles. We want to experiment with different prompts to see which one produces summaries that are concise (e.g., under a certain word count) and capture the main points effectively (which we'll approximate with a simple check here).
Our goal is to create a workflow that allows us to:
For this practice, we'll assume you have access to an LLM endpoint. We will use a placeholder function call_llm_api
to represent this interaction.
A simple and effective way to manage prompts, especially during development and for smaller teams, is using a Git repository. Let's structure our prompts:
prompt_workspace/
├── prompts/
│ └── summarization/
│ ├── v1_concise.txt
│ └── v2_detailed_points.txt
├── evaluation_data/
│ └── articles.json
└── run_evaluation.py
Create the prompts/summarization
directory and add two example prompt files:
prompts/summarization/v1_concise.txt:
Summarize the following article in 50 words or less:
{article_text}
prompts/summarization/v2_detailed_points.txt:
Provide a brief summary of the key points from the following article. Aim for clarity and conciseness.
Article:
{article_text}
Summary:
Notice the use of {article_text}
as a placeholder. We'll use basic string formatting or a templating engine later.
Create an evaluation_data
directory and a file named articles.json
with a few sample articles:
evaluation_data/articles.json:
[
{
"id": "article_001",
"text": "Researchers have developed a new AI technique capable of generating realistic images from text descriptions. The model, called Imagen, shows significant improvements over previous methods, particularly in rendering complex scenes and relationships between objects. This advancement could impact fields ranging from graphic design to virtual reality."
},
{
"id": "article_002",
"text": "Global supply chains continue to face disruptions due to a combination of factors, including geopolitical tensions, lingering pandemic effects, and increased demand for certain goods. Experts predict these challenges may persist, leading to higher prices and longer wait times for consumers. Companies are exploring strategies like regionalization and diversification to mitigate risks."
}
]
Now, let's create the run_evaluation.py
script. This script will orchestrate the process.
import os
import json
import glob
from collections import defaultdict
# --- Placeholder LLM Interaction ---
# In a real scenario, this would involve API calls to an LLM endpoint
# (e.g., using libraries like openai, huggingface_hub, anthropic)
# It needs to handle authentication, potential errors, and retries.
def call_llm_api(prompt):
"""
Placeholder function to simulate calling an LLM API.
Replace with your actual LLM API call.
"""
print(f"--- Calling LLM with prompt snippet: {prompt[:100]}... ---")
# Simulate different outputs based on prompt hints for demonstration
if "50 words or less" in prompt:
simulated_summary = "New AI generates images from text. Imagen model improves realism for complex scenes, impacting design and VR."
elif "key points" in prompt:
simulated_summary = "AI Breakthrough: Imagen generates realistic images from text descriptions, outperforming prior models.\nSupply Chain Issues: Global disruptions persist due to various factors, potentially raising prices. Companies seek mitigation strategies."
else:
simulated_summary = "This is a generic summary output."
# Simulate word count
word_count = len(simulated_summary.split())
print(f"--- Simulated LLM Response (Words: {word_count}): {simulated_summary} ---")
return simulated_summary, word_count
# --- Evaluation Logic ---
def evaluate_summary(summary, word_count, max_words=50):
"""
Simple evaluation: check if summary meets the word count criteria.
More sophisticated evaluations could involve:
- Checking for specific keywords
- Using another LLM for quality assessment (LLM-as-judge)
- Semantic similarity to original text (requires embeddings)
- Human review scoring
"""
meets_criteria = word_count <= max_words
return {"word_count": word_count, "meets_criteria": meets_criteria}
# --- Workflow Orchestration ---
def load_prompts(prompt_dir):
"""Loads prompts from text files in a directory."""
prompts = {}
pattern = os.path.join(prompt_dir, "*.txt")
for filepath in glob.glob(pattern):
prompt_name = os.path.basename(filepath).replace(".txt", "")
with open(filepath, 'r', encoding='utf-8') as f:
prompts[prompt_name] = f.read()
return prompts
def load_evaluation_data(data_path):
"""Loads evaluation data from a JSON file."""
with open(data_path, 'r', encoding='utf-8') as f:
data = json.load(f)
return data
def run_evaluation(prompts, eval_data):
"""Runs prompts against evaluation data and collects results."""
results = defaultdict(list)
for prompt_name, prompt_template in prompts.items():
print(f"\n===== Evaluating Prompt: {prompt_name} =====")
prompt_results = []
for item in eval_data:
article_id = item['id']
article_text = item['text']
# Simple templating using f-string replacement
try:
filled_prompt = prompt_template.format(article_text=article_text)
except KeyError:
print(f"Warning: Prompt '{prompt_name}' missing placeholder 'article_text'. Skipping.")
continue
# Call the LLM (using placeholder)
summary, word_count = call_llm_api(filled_prompt)
# Evaluate the result (simple length check)
evaluation_metrics = evaluate_summary(summary, word_count, max_words=50)
prompt_results.append({
"article_id": article_id,
"summary": summary,
"metrics": evaluation_metrics
})
results[prompt_name] = prompt_results
return results
def display_results(results):
"""Displays a summary of the evaluation results."""
print("\n===== Evaluation Summary =====")
for prompt_name, prompt_results in results.items():
total_items = len(prompt_results)
if total_items == 0:
print(f"\n--- Prompt: {prompt_name} ---")
print("No results generated (check prompt placeholders?).")
continue
met_criteria_count = sum(1 for r in prompt_results if r['metrics']['meets_criteria'])
avg_word_count = sum(r['metrics']['word_count'] for r in prompt_results) / total_items
print(f"\n--- Prompt: {prompt_name} ---")
print(f" Evaluated on: {total_items} articles")
print(f" Met Word Count Criteria (<50 words): {met_criteria_count}/{total_items} ({met_criteria_count/total_items:.1%})")
print(f" Average Word Count: {avg_word_count:.1f}")
# --- Main Execution ---
if __name__ == "__main__":
PROMPT_DIR = "prompts/summarization"
EVAL_DATA_PATH = "evaluation_data/articles.json"
# 1. Load prompts
prompts = load_prompts(PROMPT_DIR)
if not prompts:
print(f"Error: No prompts found in {PROMPT_DIR}")
exit()
print(f"Loaded prompts: {list(prompts.keys())}")
# 2. Load evaluation data
eval_data = load_evaluation_data(EVAL_DATA_PATH)
print(f"Loaded {len(eval_data)} evaluation articles.")
# 3. Run evaluation
evaluation_results = run_evaluation(prompts, eval_data)
# 4. Display results
display_results(evaluation_results)
# Potential next step: Save detailed results to CSV or JSON
# with open("evaluation_results.json", "w") as f:
# json.dump(evaluation_results, f, indent=2)
# print("\nDetailed results saved to evaluation_results.json")
Navigate to the prompt_workspace
directory in your terminal and run the script:
python run_evaluation.py
You should see output similar to this (details depend on the placeholder call_llm_api
logic):
Loaded prompts: ['v1_concise', 'v2_detailed_points']
Loaded 2 evaluation articles.
===== Evaluating Prompt: v1_concise =====
--- Calling LLM with prompt snippet: Summarize the following article in 50 words or less:
Researchers have developed a new AI technique c... ---
--- Simulated LLM Response (Words: 19): New AI generates images from text. Imagen model improves realism for complex scenes, impacting design and VR. ---
--- Calling LLM with prompt snippet: Summarize the following article in 50 words or less:
Global supply chains continue to face disrup... ---
--- Simulated LLM Response (Words: 19): New AI generates images from text. Imagen model improves realism for complex scenes, impacting design and VR. ---
===== Evaluating Prompt: v2_detailed_points =====
--- Calling LLM with prompt snippet: Provide a brief summary of the key points from the following article. Aim for clarity and conciseness.... ---
--- Simulated LLM Response (Words: 43): AI Breakthrough: Imagen generates realistic images from text descriptions, outperforming prior models.
Supply Chain Issues: Global disruptions persist due to various factors, potentially raising prices. Companies seek mitigation strategies. ---
--- Calling LLM with prompt snippet: Provide a brief summary of the key points from the following article. Aim for clarity and conciseness.... ---
--- Simulated LLM Response (Words: 43): AI Breakthrough: Imagen generates realistic images from text descriptions, outperforming prior models.
Supply Chain Issues: Global disruptions persist due to various factors, potentially raising prices. Companies seek mitigation strategies. ---
===== Evaluation Summary =====
--- Prompt: v1_concise ---
Evaluated on: 2 articles
Met Word Count Criteria (<50 words): 2/2 (100.0%)
Average Word Count: 19.0
--- Prompt: v2_detailed_points ---
Evaluated on: 2 articles
Met Word Count Criteria (<50 words): 2/2 (100.0%)
Average Word Count: 43.0
Based on these (simulated) results, both prompts meet our simple length criteria, but v1_concise
produces much shorter summaries on average, while v2_detailed_points
is closer to the limit and potentially more detailed (though our simple metric doesn't capture quality). This quantitative comparison helps guide which prompt might be better suited for the application's specific needs.
This script represents a single evaluation run. To integrate it into a larger LLMOps workflow:
prompts/
directory are updated (e.g., on push to a specific branch or via pull requests).A diagram representing the automated prompt evaluation and promotion workflow integrated with CI/CD.
This example provides a fundamental structure. You can extend it significantly:
By establishing even a basic prompt management workflow like this, you move towards a more systematic, data-driven approach to developing and operating applications powered by large language models.
© 2025 ApX Machine Learning