Testing Prompt Generalization

Numerical scores like perplexity and ROUGE establish a baseline understanding of how accurately a model reproduces validation data. However, human users rarely format their requests exactly like the examples in a training dataset. Testing prompt generalization measures how effectively a fine-tuned model handles unfamiliar phrasing, varied structures, and instructions it has not explicitly seen during the training phase.

If you fine-tuned a model to extract action items from meeting notes using a strict template, it might perform perfectly when prompted with the exact string "Extract action items:". If the same model completely fails when asked "What do we need to do next based on these notes?", the fine-tuning process has created a brittle representation. The model memorized the template rather than learning the underlying task. Measuring this flexibility is an important component of evaluating language models.

To systematically test generalization, you can construct an evaluation set consisting of grouped variations of the same fundamental task. You start with the original prompt structure from your dataset and generate multiple paraphrased versions.

Workflow for testing prompt generalization by passing varied prompt structures through the fine-tuned model and comparing output consistency.

There are three primary techniques to apply when building these variations:

Instruction Paraphrasing: Rewrite the system prompt or primary instruction using synonyms and different sentence structures. Change formal requests into casual questions.
Context Perturbation: Alter the context provided to the model. Swap the order of paragraphs, introduce minor typographical errors, or add completely irrelevant sentences to the prompt to see if the model still focuses on the correct information.
Format Switching: Request the output in a format not heavily represented in the training data. If your model was trained to output JSON, ask for an HTML table. This tests whether the model can disentangle the task logic from the output formatting.

Evaluating the results of these tests requires a shift from exact-match metrics to semantic metrics. Standard metrics like ROUGE are sensitive to specific word choices and will penalize the model if the output formatting changes slightly. Instead, you can use an external embedding model, such as a pre-trained sentence transformer, to convert the generated text into dense vector representations.

Once you have the vector embeddings, you can calculate the cosine similarity between the output generated by the original prompt and the output generated by the varied prompt. If $\mathbf{u}$ represents the embedding vector of the original output and $\mathbf{v}$ represents the embedding vector of the varied output, the cosine similarity is calculated as:

$\text{Cosine Similarity} (\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$

A high cosine similarity score indicates that the semantic meaning of the outputs remains consistent regardless of how the prompt was phrased. The model is generalizing well. If the similarity score drops significantly when the prompt is varied, the model is overly sensitive to the input format.

You can automate this process by writing an evaluation loop in Python. Load your fine-tuned model and a separate sentence transformer model. Iterate through a JSON file containing lists of synonymous prompts. Generate a response for every prompt in the list, convert the responses to embeddings, and calculate the variance in similarity scores for each group.

If your automated evaluation reveals poor generalization, this often points directly to issues in your dataset preparation or training parameters. A model that only responds to a single, specific prompt format has likely been overtrained on that exact string. Observing these failures naturally leads to the next phase of evaluation, where you must detect signs of overfitting and catastrophic forgetting by comparing these exact outputs against the capabilities of the original base model.

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Nils Reimers, Iryna Gurevych, 2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Association for Computational Linguistics) DOI: 10.18653/v1/D19-1410 - Introduces the Sentence Transformers framework used for calculating semantic similarity between text outputs.
BERTScore: Evaluating Text Generation with BERT, Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi, 2020 International Conference on Learning Representations (ICLR) - Presents a metric that leverages pre-trained embeddings to evaluate the quality of generated text, offering an alternative to lexical metrics like ROUGE.
PromptBench: A Unified Library for Robustness Evaluation of Language Models, Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, Xing Xie, 2023 arXiv preprint DOI: 10.48550/arXiv.2306.04528 - A comprehensive study and software library focused on evaluating how language models respond to prompt perturbations and adversarial attacks.