Masterclass
While intrinsic metrics like perplexity give us a measure of how well a model predicts the next token, they don't always tell the full story about a model's practical usefulness. Evaluating a model based solely on its internal language modeling capability can be insufficient for understanding its performance on specific applications.
This chapter shifts focus to extrinsic evaluation methods. We will examine how to assess Large Language Models (LLMs) by measuring their performance on concrete downstream Natural Language Processing (NLP) tasks. This approach provides a more grounded understanding of the model's capabilities in scenarios it might encounter after deployment.
You will learn about:
By the end of this chapter, you will understand how to measure an LLM's effectiveness through its application to specific, practical NLP problems, complementing the insights gained from intrinsic metrics.
22.1 Rationale for Downstream Task Evaluation
22.2 Common Downstream NLP Tasks
22.3 Fine-tuning Procedures for Evaluation
22.4 Standard Benchmarks: GLUE and SuperGLUE
22.5 Few-Shot and Zero-Shot Evaluation
22.6 Developing Custom Evaluation Tasks
© 2025 ApX Machine Learning