Introduction to Neural Network Language Models

N-gram models are a conventional approach for understanding language probabilities. Despite their foundational role, these models encounter a significant limitation: they struggle with word sequences not encountered during training. For example, if the trigram "wreck a nice" never appeared in text data, an N-gram model would assign a zero probability to the word "beach" following it. This problem, known as data sparsity, means N-gram models have limited capabilities for generalization.

This is where neural networks provide a more powerful and flexible approach to language modeling. Instead of simply counting word co-occurrences, a neural network learns to represent words based on their context and meaning.

From Words to Vectors

The first step in a neural language model is to move away from treating words as simple, distinct text labels. Instead, each word is mapped to a dense list of numbers called a word embedding or a word vector.

The important property of these vectors is that words with similar meanings or that are used in similar contexts will have similar vectors. For example, the vectors for "nice", "good", and "lovely" would be mathematically close to each other in the vector space. This single change allows the model to generalize far better than an N-gram model. If the model has learned the phrase "a good day", it can infer that "a nice day" is also a likely phrase because the vectors for "good" and "nice" are similar.

A comparison of the N-gram and neural network approaches. The N-gram model relies on direct lookups, while the neural network processes numerical representations of words to understand context.

Capturing Longer Dependencies

Another limitation of N-gram models is their fixed, short-term memory. A trigram model, for instance, only ever looks at the two preceding words. It has no information about words that appeared earlier in the sentence.

Neural network architectures designed for sequences, such as Recurrent Neural Networks (RNNs), address this. An RNN processes a sentence one word at a time and maintains an internal state, or "memory," that is updated with each new word. This state allows the model to retain information from the beginning of a sentence and use it to make better predictions later on. For example, in the sentence "My friends from Germany, who I haven't seen in years, are finally coming to visit. I can't wait to speak... ", an RNN is much more likely to predict the word "German" than an N-gram model because it can remember the context of "Germany" from much earlier.

The Modern Standard in ASR

In summary, neural network language models offer two main advantages over traditional N-gram models:

Better Generalization: By using word embeddings, they can handle new or rare word combinations effectively.
Longer Context: Architectures like RNNs can capture dependencies across an entire sentence, not just a small, fixed window.

Because of this superior performance, neural network-based language models are now the standard in virtually all modern speech recognition systems. While the details of models like LSTMs (Long Short-Term Memory networks) and Transformers are topics for a more advanced course, your understanding of N-grams provides the perfect background for appreciating why these more complex models are so effective.

Was this section helpful?

References

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky and James H. Martin, 2025 (Stanford University (Online Draft)) - A comprehensive textbook that covers traditional N-gram models, neural language models, word embeddings, and sequence models like RNNs and Transformers.
A Neural Probabilistic Language Model, Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin, 2003 Journal of Machine Learning Research, Vol. 3 - A foundational paper that introduces one of the earliest models for learning word embeddings and a neural network-based language model.
Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, 2013 arXiv preprint arXiv:1301.3781 DOI: 10.48550/arXiv.1301.3781 - Introduces the Word2Vec models (Skip-gram and CBOW) for creating high-quality word embeddings efficiently.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems 30 (Curran Associates, Inc.) - Introduces the Transformer architecture, which revolutionized sequence modeling and became the basis for modern large language models.
CS224N: Natural Language Processing with Deep Learning, Christopher Manning and Abigail See, 2023 (Stanford University) - An advanced university course providing lecture videos and notes covering word embeddings, recurrent neural networks, and modern transformer models.