Effective prompt management is a foundation of building reliable and adaptable LLM applications. As discussed earlier in this chapter, treating prompts as critical configuration artifacts, subject to versioning, testing, and controlled rollout, is essential for operational maturity. Simply changing a prompt string directly in production code leads to unpredictable behavior and makes systematic improvement nearly impossible.This practice section provides a hands-on example of building a basic workflow for managing and evaluating prompts. We'll focus on creating a repeatable process to compare different prompt versions against a predefined evaluation set, forming the foundation for more sophisticated prompt engineering operations.Scenario: Improving Text Summarization PromptsImagine we have an application that uses an LLM to summarize articles. We want to experiment with different prompts to see which one produces summaries that are concise (e.g., under a certain word count) and capture the main points effectively (which we'll approximate with a simple check here).Our goal is to create a workflow that allows us to:Store different versions of summarization prompts.Define a small evaluation dataset of articles.Run each prompt against the evaluation dataset using an LLM.Evaluate the generated summaries based on length.Compare the performance of different prompts.For this practice, we'll assume you have access to an LLM endpoint. We will use a placeholder function call_llm_api to represent this interaction.Setting up the Prompt RepositoryA simple and effective way to manage prompts, especially during development and for smaller teams, is using a Git repository. Let's structure our prompts:prompt_workspace/ ├── prompts/ │ └── summarization/ │ ├── v1_concise.txt │ └── v2_detailed_points.txt ├── evaluation_data/ │ └── articles.json └── run_evaluation.pyCreate the prompts/summarization directory and add two example prompt files:prompts/summarization/v1_concise.txt:Summarize the following article in 50 words or less: {article_text}prompts/summarization/v2_detailed_points.txt:Provide a brief summary of the important points from the following article. Aim for clarity and conciseness. Article: {article_text} Summary:Notice the use of {article_text} as a placeholder. We'll use basic string formatting or a templating engine later.Preparing Evaluation DataCreate an evaluation_data directory and a file named articles.json with a few sample articles:evaluation_data/articles.json:[ { "id": "article_001", "text": "Researchers have developed a new AI technique capable of generating realistic images from text descriptions. The model, called Imagen, shows significant improvements over previous methods, particularly in rendering complex scenes and relationships between objects. This advancement could impact fields ranging from graphic design to virtual reality." }, { "id": "article_002", "text": "Global supply chains continue to face disruptions due to a combination of factors, including geopolitical tensions, lingering pandemic effects, and increased demand for certain goods. Experts predict these challenges may persist, leading to higher prices and longer wait times for consumers. Companies are exploring strategies like regionalization and diversification to mitigate risks." } ]Building the Evaluation Workflow ScriptNow, let's create the run_evaluation.py script. This script will orchestrate the process.import os import json import glob from collections import defaultdict # --- Placeholder LLM Interaction --- # In a real scenario, this would involve API calls to an LLM endpoint # (e.g., using libraries like openai, huggingface_hub, anthropic) # It needs to handle authentication, potential errors, and retries. def call_llm_api(prompt): """ Placeholder function to simulate calling an LLM API. Replace with your actual LLM API call. """ print(f"--- Calling LLM with prompt snippet: {prompt[:100]}... ---") # Simulate different outputs based on prompt hints for demonstration if "50 words or less" in prompt: simulated_summary = "New AI generates images from text. Imagen model improves realism for complex scenes, impacting design and VR." elif "important points" in prompt: simulated_summary = "AI Breakthrough: Imagen generates realistic images from text descriptions, outperforming prior models.\nSupply Chain Issues: Global disruptions persist due to various factors, potentially raising prices. Companies seek mitigation strategies." else: simulated_summary = "This is a generic summary output." # Simulate word count word_count = len(simulated_summary.split()) print(f"--- Simulated LLM Response (Words: {word_count}): {simulated_summary} ---") return simulated_summary, word_count # --- Evaluation Logic --- def evaluate_summary(summary, word_count, max_words=50): """ Simple evaluation: check if summary meets the word count criteria. More sophisticated evaluations could involve: - Checking for specific keywords - Using another LLM for quality assessment (LLM-as-judge) - Semantic similarity to original text (requires embeddings) - Human review scoring """ meets_criteria = word_count <= max_words return {"word_count": word_count, "meets_criteria": meets_criteria} # --- Workflow Orchestration --- def load_prompts(prompt_dir): """Loads prompts from text files in a directory.""" prompts = {} pattern = os.path.join(prompt_dir, "*.txt") for filepath in glob.glob(pattern): prompt_name = os.path.basename(filepath).replace(".txt", "") with open(filepath, 'r', encoding='utf-8') as f: prompts[prompt_name] = f.read() return prompts def load_evaluation_data(data_path): """Loads evaluation data from a JSON file.""" with open(data_path, 'r', encoding='utf-8') as f: data = json.load(f) return data def run_evaluation(prompts, eval_data): """Runs prompts against evaluation data and collects results.""" results = defaultdict(list) for prompt_name, prompt_template in prompts.items(): print(f"\n===== Evaluating Prompt: {prompt_name} =====") prompt_results = [] for item in eval_data: article_id = item['id'] article_text = item['text'] # Simple templating using f-string replacement try: filled_prompt = prompt_template.format(article_text=article_text) except KeyError: print(f"Warning: Prompt '{prompt_name}' missing placeholder 'article_text'. Skipping.") continue # Call the LLM (using placeholder) summary, word_count = call_llm_api(filled_prompt) # Evaluate the result (simple length check) evaluation_metrics = evaluate_summary(summary, word_count, max_words=50) prompt_results.append({ "article_id": article_id, "summary": summary, "metrics": evaluation_metrics }) results[prompt_name] = prompt_results return results def display_results(results): """Displays a summary of the evaluation results.""" print("\n===== Evaluation Summary =====") for prompt_name, prompt_results in results.items(): total_items = len(prompt_results) if total_items == 0: print(f"\n--- Prompt: {prompt_name} ---") print("No results generated (check prompt placeholders?).") continue met_criteria_count = sum(1 for r in prompt_results if r['metrics']['meets_criteria']) avg_word_count = sum(r['metrics']['word_count'] for r in prompt_results) / total_items print(f"\n--- Prompt: {prompt_name} ---") print(f" Evaluated on: {total_items} articles") print(f" Met Word Count Criteria (<50 words): {met_criteria_count}/{total_items} ({met_criteria_count/total_items:.1%})") print(f" Average Word Count: {avg_word_count:.1f}") # --- Main Execution --- if __name__ == "__main__": PROMPT_DIR = "prompts/summarization" EVAL_DATA_PATH = "evaluation_data/articles.json" # 1. Load prompts prompts = load_prompts(PROMPT_DIR) if not prompts: print(f"Error: No prompts found in {PROMPT_DIR}") exit() print(f"Loaded prompts: {list(prompts.keys())}") # 2. Load evaluation data eval_data = load_evaluation_data(EVAL_DATA_PATH) print(f"Loaded {len(eval_data)} evaluation articles.") # 3. Run evaluation evaluation_results = run_evaluation(prompts, eval_data) # 4. Display results display_results(evaluation_results) # Potential next step: Save detailed results to CSV or JSON # with open("evaluation_results.json", "w") as f: # json.dump(evaluation_results, f, indent=2) # print("\nDetailed results saved to evaluation_results.json") Running the EvaluationNavigate to the prompt_workspace directory in your terminal and run the script:python run_evaluation.pyYou should see output similar to this (details depend on the placeholder call_llm_api logic):Loaded prompts: ['v1_concise', 'v2_detailed_points'] Loaded 2 evaluation articles. ===== Evaluating Prompt: v1_concise ===== --- Calling LLM with prompt snippet: Summarize the following article in 50 words or less: Researchers have developed a new AI technique c... --- --- Simulated LLM Response (Words: 19): New AI generates images from text. Imagen model improves realism for complex scenes, impacting design and VR. --- --- Calling LLM with prompt snippet: Summarize the following article in 50 words or less: Global supply chains continue to face disrup... --- --- Simulated LLM Response (Words: 19): New AI generates images from text. Imagen model improves realism for complex scenes, impacting design and VR. --- ===== Evaluating Prompt: v2_detailed_points ===== --- Calling LLM with prompt snippet: Provide a brief summary of the important points from the following article. Aim for clarity and conciseness.... --- --- Simulated LLM Response (Words: 43): AI Breakthrough: Imagen generates realistic images from text descriptions, outperforming prior models. Supply Chain Issues: Global disruptions persist due to various factors, potentially raising prices. Companies seek mitigation strategies. --- --- Calling LLM with prompt snippet: Provide a brief summary of the important points from the following article. Aim for clarity and conciseness.... --- --- Simulated LLM Response (Words: 43): AI Breakthrough: Imagen generates realistic images from text descriptions, outperforming prior models. Supply Chain Issues: Global disruptions persist due to various factors, potentially raising prices. Companies seek mitigation strategies. --- ===== Evaluation Summary ===== --- Prompt: v1_concise --- Evaluated on: 2 articles Met Word Count Criteria (<50 words): 2/2 (100.0%) Average Word Count: 19.0 --- Prompt: v2_detailed_points --- Evaluated on: 2 articles Met Word Count Criteria (<50 words): 2/2 (100.0%) Average Word Count: 43.0Based on these (simulated) results, both prompts meet our simple length criteria, but v1_concise produces much shorter summaries on average, while v2_detailed_points is closer to the limit and potentially more detailed (though our simple metric doesn't capture quality). This quantitative comparison helps guide which prompt might be better suited for the application's specific needs.Integrating into MLOps PipelinesThis script represents a single evaluation run. To integrate it into a larger LLMOps workflow:Triggering: Automate execution using a CI/CD system (like Jenkins, GitHub Actions, GitLab CI). Trigger runs automatically when prompts in the prompts/ directory are updated (e.g., on push to a specific branch or via pull requests).Experiment Tracking: Log results (prompts, metrics, sample outputs) to an experiment tracking platform (MLflow, Weights & Biases, Comet ML). This allows historical comparison and better visualization.Prompt Promotion: Based on evaluation results meeting certain thresholds, the CI/CD pipeline could automatically tag the corresponding Git commit for the prompt version as "production-ready" or update a configuration file used by the deployed application to point to the new preferred prompt.Visualization: A simple diagram illustrating this flow:digraph G { rankdir=LR; node [shape=box, style=rounded, fontname="sans-serif", fillcolor="#e9ecef", style=filled]; edge [fontname="sans-serif"]; subgraph cluster_repo { label = "Git Repository"; style=filled; color="#dee2e6"; prompt_v1 [label="Prompt v1", fillcolor="#ffffff"]; prompt_v2 [label="Prompt v2", fillcolor="#ffffff"]; eval_data [label="Eval Data", fillcolor="#ffffff"]; eval_script [label="Evaluation Script", fillcolor="#ffffff"]; } subgraph cluster_cicd { label = "CI/CD Pipeline"; style=filled; color="#dee2e6"; trigger [label="Git Commit Trigger", shape=ellipse, fillcolor="#a5d8ff"]; run_eval [label="Run Evaluation Script", fillcolor="#74c0fc"]; log_results [label="Log Results\n(Experiment Tracking)", fillcolor="#74c0fc"]; decision [label="Decision Gate\n(Metrics Threshold)", shape=diamond, fillcolor="#ffe066"]; promote [label="Promote Prompt\n(Tagging/Config Update)", fillcolor="#69db7c"]; } llm_api [label="LLM API", shape=cylinder, fillcolor="#bac8ff"]; deployed_app [label="Deployed Application", fillcolor="#96f2d7"]; prompt_v2 -> trigger [style=dashed, label="Update"]; trigger -> run_eval; eval_script -> run_eval [style=dashed, label="Uses"]; eval_data -> run_eval [style=dashed, label="Uses"]; {prompt_v1, prompt_v2} -> run_eval [style=dashed, label="Uses"]; run_eval -> llm_api [label="Calls"]; llm_api -> run_eval [label="Returns Summaries"]; run_eval -> log_results [label="Sends Metrics"]; log_results -> decision; decision -> promote [label="Pass"]; decision -> trigger [label="Fail/Iterate", style=dashed]; promote -> deployed_app [label="Updates Config"]; }A diagram representing the automated prompt evaluation and promotion workflow integrated with CI/CD.Next StepsThis example provides a fundamental structure. You can extend it significantly:More Sophisticated Templating: Use engines like Jinja2 for more complex logic within prompts (conditionals, loops).Richer Evaluation: Implement metrics. Consider using embedding models for semantic similarity checks, specific keyword extraction, or even LLM-as-judge patterns for qualitative assessments.Human-in-the-Loop: Integrate steps for human review and annotation of outputs, especially for tasks where automated metrics fall short.Prompt Chaining/Agents: Extend the concept to manage sequences or graphs of prompts used in more complex agentic systems.Scalability: For very large evaluation datasets or frequent runs, optimize the LLM interaction (batching, asynchronous calls) and potentially distribute the evaluation workload.By establishing even a basic prompt management workflow like this, you move towards a more systematic, data-driven approach to developing and operating applications powered by large language models.