Speech and Language Processing (3rd edition draft), Daniel Jurafsky and James H. Martin, 2025 - A widely used textbook that covers fundamental concepts in natural language processing, including text normalization, tokenization, and other preprocessing steps.
Unicode Normalization Forms, Ken Whistler, 2025Unicode Technical Report #15 (Unicode Consortium) - The authoritative technical report from the Unicode Consortium that defines and explains the different Unicode normalization forms (NFC, NFD, NFKC, NFKD).
unicodedata - Unicode Database, Python Software Foundation, 2023 - Official documentation for Python's unicodedata module, which provides access to the Unicode character database and functions for text normalization.