While subword tokenization algorithms like BPE or WordPiece significantly reduce the problem of out-of-vocabulary (OOV) words and manage vocabulary size, they don't inherently provide structural information needed by many transformer-based models. To handle tasks like sequence classification, sentence pair comparison, or simply defining sequence boundaries, we introduce dedicated special tokens. These tokens are treated as distinct vocabulary items, usually reserved and added explicitly when training the tokenizer or configuring a pre-trained one.

Let's examine the most common special tokens and their roles:

Padding Token: `[PAD]`

Machine learning models typically process data in batches for efficiency. However, sequences within a batch often have different lengths. To form rectangular tensors required by deep learning frameworks, shorter sequences need to be padded to match the length of the longest sequence in the batch. The [PAD] token serves this purpose.

Example: Consider tokenizing two sentences: "Hello world" and "Tokenization example". If tokenized into IDs, they might look like [101, 7592, 2088, 102] and [101, 19204, 7404, 2742, 102]. To batch them with a maximum length of 6, assuming [PAD] has ID 0, the batch would be:

[[101, 7592, 2088,  102,    0,    0],
 [101, 19204, 7404, 2742,  102,    0]]

It's important that the model ignores these padding tokens during computation, especially in the attention mechanism. This is achieved using an attention mask, a binary tensor indicating which tokens should be attended to (1) and which should be ignored (0). For the example above, the mask would be:

[[1, 1, 1, 1, 0, 0],
 [1, 1, 1, 1, 1, 0]]

Unknown Token: `[UNK]`

Even with subword tokenization, there might be rare characters or character sequences (e.g., typos, unsupported symbols, truly novel words) that were not encountered during tokenizer training and cannot be broken down into known subwords. The [UNK] token represents these unknown entities. While subword methods aim to minimize the frequency of [UNK] tokens, having a fallback is necessary. A high frequency of [UNK] tokens during inference often indicates a mismatch between the tokenization vocabulary and the input data distribution, potentially degrading model performance.

Classification Token: `[CLS]`

Some transformer architectures, notably BERT, prefix every input sequence with a special [CLS] token. The final hidden state corresponding to this token is often used as an aggregate representation of the entire sequence. This representation can then be fed into a classifier head for sequence-level tasks like sentiment analysis or topic classification. While other pooling strategies exist (e.g., mean pooling of token embeddings), using the [CLS] token's output is a common convention.

Separator Token: `[SEP]`

To handle tasks involving multiple text segments (e.g., question answering where input is [Question, Context], or natural language inference with [Premise, Hypothesis]), a separator token [SEP] is used. It explicitly marks the boundary between distinct segments within a single input sequence fed to the model.

Example Input Format for BERT on a Sentence Pair Task: [CLS] Sentence A tokens [SEP] Sentence B tokens [SEP]

Note that some models might use [SEP] only once, while others might use it after each segment, including the last one.

Mask Token: `[MASK]`

This token is specific to the Masked Language Modeling (MLM) pre-training objective, popularized by BERT. During pre-training, some percentage of the input tokens are randomly replaced with the [MASK] token. The model's task is then to predict the original tokens that were masked, based on the surrounding context. This forces the model to learn rich bidirectional representations. The [MASK] token typically isn't used during fine-tuning or standard inference unless the task specifically involves predicting masked spans.

Other Potential Tokens: `[BOS]`, `[EOS]`

Autoregressive models like GPT often use Begin-of-Sequence ([BOS] or <s>) and End-of-Sequence ([EOS] or </s>) tokens. [BOS] signals the start of generation, while [EOS] indicates the completion of a sequence. While similar to [CLS] and [SEP] in marking boundaries, their primary role is in the context of generating text one token at a time.

Integration and Usage

Special tokens are added to the tokenizer's vocabulary and assigned unique IDs, just like regular subword tokens. Libraries like Hugging Face transformers manage this process. When you load a pre-trained tokenizer, it comes configured with the special tokens used during its pre-training.

import torch
from transformers import AutoTokenizer

# Load a pre-trained tokenizer (e.g., BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Example sentences
sentence1 = "This is sentence one."
sentence2 = "This is another sentence."

# Tokenize a pair of sentences for a next-sentence prediction style task
encoded_input = tokenizer(
    sentence1,
    sentence2,
    padding=True,
    truncation=True,
    return_tensors='pt'
)

print("Token IDs:")
print(encoded_input['input_ids'])

print("\nAttention Mask:")
print(encoded_input['attention_mask'])

print("\nDecoded Tokens:")
# We can decode to see the special tokens added
print(
    tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])
)

# Accessing special token IDs directly
print(f"\n[CLS] ID: {tokenizer.cls_token_id}")
print(f"[SEP] ID: {tokenizer.sep_token_id}")
print(f"[PAD] ID: {tokenizer.pad_token_id}")
print(f"[UNK] ID: {tokenizer.unk_token_id}")
print(f"[MASK] ID: {tokenizer.mask_token_id}")

Executing this code would show how the tokenizer automatically adds [CLS] at the beginning, [SEP] between the sentences and at the end, and pads the shorter sequence (if padding were needed in this specific non-padded example, which it isn't as both sentences might encode to the same length here). The attention_mask correctly identifies the non-padding tokens.

Token IDs:
tensor([[  101,  2023,  2003,  6251,  2028,  1012,   102,  2023,  2003,  2178,
          6251,  1012,   102]])

Attention Mask:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Decoded Tokens:
['[CLS]', 'this', 'is', 'sentence', 'one', '.', '[SEP]', 'this', 'is', 'another', 'sentence', '.', '[SEP]']

[CLS] ID: 101
[SEP] ID: 102
[PAD] ID: 0
[UNK] ID: 100
[MASK] ID: 103

Understanding and correctly managing these special tokens is fundamental. They provide the necessary structural cues for the model to interpret the input sequences correctly for various pre-training and downstream tasks. The specific set of special tokens and their usage patterns often depend on the model architecture and the objectives it was trained on. Always consult the documentation for the specific model and tokenizer you are using.

Was this section helpful?

Handling Special Tokens

Padding Token: [PAD]

Unknown Token: [UNK]

Classification Token: [CLS]

Separator Token: [SEP]

Mask Token: [MASK]

Other Potential Tokens: [BOS], [EOS]