As we've seen in the chapter introduction, AI systems require data to be in a numerical format to perform their tasks. While an image might be represented as a grid of pixel values, text, which is fundamentally a sequence of characters, also needs a similar transformation. This section will guide you through how raw text, composed of letters, punctuation, and spaces, is converted into numerical structures that AI models can understand and learn from. This process is the first step in enabling machines to derive meaning from written language.
The very first step in processing text is usually tokenization. Think of tokenization as breaking down a long sentence or paragraph into smaller, manageable pieces called "tokens." Most commonly, these tokens are words. For example, the sentence "AI is fun!" can be tokenized into:
Punctuation marks like "!" are often treated as separate tokens because they can carry meaning (e.g., indicating excitement or the end of a sentence). Sometimes, text is tokenized into sub-word units or even individual characters, depending on the specific requirements of the AI task. For beginners, word-level tokenization is a great starting point.
A simple illustration of tokenizing a sentence into words and punctuation.
Once we have a collection of tokens, the next step is to build a vocabulary. A vocabulary is essentially a list of all unique tokens found in our text data. Each unique token in this vocabulary is then assigned a unique numerical identifier (an integer ID). This is like creating a dictionary where each word (token) has a specific page number (ID).
For instance, if our entire dataset only contained the sentences "AI is fun" and "AI is great", our tokens would be "AI", "is", "fun", "great". Our vocabulary could look like this:
So, the sentence "AI is fun" would be represented numerically as the sequence [0, 1, 2]
.
An example showing how tokens from a sentence are mapped to numerical IDs based on a defined vocabulary.
This numerical sequence is a basic way to represent text, but AI models often benefit from more structured vector representations.
While sequences of IDs are a start, we often convert these tokens or sequences of tokens into vectors. A vector is simply an array of numbers. There are several common methods to do this.
One-Hot Encoding (OHE) is a straightforward technique. For each token, we create a vector that is as long as our entire vocabulary. This vector is filled with zeros, except for a single '1' at the index corresponding to the token's ID in the vocabulary.
Imagine our vocabulary is {"the": 0, "cat": 1, "sat": 2}
.
[0, 1, 0]
.[1, 0, 0]
.A token 'cat' represented as a one-hot encoded vector based on a small vocabulary. The '1' appears at the index assigned to 'cat'.
While simple, OHE has drawbacks:
The Bag-of-Words (BoW) model is another popular way to represent an entire piece of text (like a sentence or document) as a single vector. This vector has one entry for each word in the vocabulary. Each entry can be:
For example, with a vocabulary {"the", "cat", "sat", "on", "mat"}
:
[1, 1, 1, 0, 0]
(if using counts).[2, 2, 0, 0, 0]
.Bag-of-Words representation for two sentences using word counts. Each row is a vector for a sentence.
The main limitation of BoW is that it loses all information about word order. "The cat sat on the mat" and "The mat sat on the cat" would have very similar BoW representations if they used the same words, even though their meanings are different.
A common refinement of BoW is TF-IDF (Term Frequency-Inverse Document Frequency). TF-IDF doesn't just count words; it also considers how important a word is to a document within a larger collection of documents. It gives higher weights to words that are frequent in a specific document but rare across all documents. For a beginner's course, understanding that TF-IDF is a more sophisticated way to score words in the BoW model is sufficient.
The methods discussed so far (OHE, BoW) are useful, but they don't inherently capture the meaning or semantic relationships between words. This is where word embeddings come in.
Word embeddings represent words as dense, low-dimensional vectors (unlike the sparse, high-dimensional vectors of OHE). The magic of word embeddings is that words with similar meanings tend to have similar vector representations. These vectors are typically learned from large amounts of text data, and the process aims to place words in a "vector space" such that their geometric relationships reflect their semantic relationships.
For instance, in an ideal embedding space:
vector("king") - vector("man") + vector("woman")
might result in a vector very close to vector("queen")
.These vectors are "dense" because most of their values are non-zero, and "low-dimensional" typically means they have a few hundred dimensions, rather than tens of thousands like OHE.
A simplified 2D view where related words (like fruits or royalty) are closer to each other in the vector space. "Car" and "truck" form another cluster.
Word embeddings like Word2Vec, GloVe, and FastText are powerful because they provide a way for AI models to understand nuances in word meaning and context, which is a significant step from just characters to actual meaning. For now, it's important to know that they exist and represent a more advanced way to turn words into numbers that carry semantic information.
By transforming text into these various numerical formats, we lay the groundwork for AI models to process and "understand" language. Each representation has its strengths and weaknesses, and the choice often depends on the specific task at hand. These numerical representations are the fundamental building blocks for any AI system that works with text data.
Was this section helpful?
© 2025 ApX Machine Learning