Elements of Information Theory, Thomas M. Cover and Joy A. Thomas, 2006 (Wiley-Interscience) - A classic textbook providing the mathematical foundations of information theory, including entropy, cross-entropy, and the concept of bits as a measure of information content.
Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2025 - An authoritative textbook on natural language processing, covering language modeling, perplexity, and various evaluation metrics in detail.
Exploring the Limits of Language Modeling, RafaĆ Jozefowicz, Oriol Vinyals, Samy Bengio, Mohammad Norouzi, 2016Proceedings of the 33rd International Conference on Machine LearningDOI: 10.5555/3045390.3045437 - This paper investigates the performance of large-scale language models and uses bits-per-character (BPC) as a key metric for evaluating and comparing models.
Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/P16-1162 - Introduces Byte Pair Encoding (BPE), a widely used subword tokenization method, illustrating the need for metrics like BPC to fairly compare models using different tokenization schemes.