Having explored the theory behind word embeddings like Word2Vec and GloVe, let's put these concepts into practice. In this section, we'll use the popular gensim
library to train our own Word2Vec model on a small dataset, explore the resulting word vectors, and learn how to load and utilize powerful pre-trained embedding models. This hands-on experience will solidify your understanding of how distributional semantics translates into practical vector representations.
First, ensure you have the necessary libraries installed. We'll primarily use gensim
for Word2Vec, nltk
for sample data and basic tokenization, and potentially sklearn
and plotly
for visualization.
pip install gensim nltk scikit-learn plotly
You might also need to download specific nltk
resources if you haven't already:
import nltk
try:
nltk.data.find('corpora/brown')
except nltk.downloader.DownloadError:
nltk.download('brown')
nltk.download('punkt')
Word2Vec learns from sequences of words (sentences or documents). Unlike TF-IDF, which often benefits from aggressive preprocessing like stop word removal and stemming, Word2Vec generally works better with less manipulation. Basic cleaning like lowercasing and tokenization is usually sufficient. The surrounding context words, including stop words, provide valuable information for learning embeddings.
Let's use the Brown Corpus from nltk
as our sample dataset and perform minimal preprocessing:
import nltk
from nltk.corpus import brown
import string
# Load sentences from the Brown Corpus
# Each sentence is already a list of words/tokens in this corpus
raw_sentences = brown.sents()
# Preprocess: lowercase and remove punctuation
processed_sentences = []
for sentence in raw_sentences:
processed_sentence = [word.lower() for word in sentence if word not in string.punctuation]
# Ensure sentence is not empty after removing punctuation
if processed_sentence:
processed_sentences.append(processed_sentence)
print(f"Loaded and processed {len(processed_sentences)} sentences.")
# Example of a processed sentence
print("Example processed sentence:", processed_sentences[10])
This gives us a list of lists, where each inner list contains the tokens of a sentence. This is the format gensim
expects.
Now, let's train a Word2Vec model using gensim
. We need to specify several hyperparameters:
sentences
: The input data (our processed_sentences
).vector_size
: The dimensionality of the word vectors (e.g., 100, 300). Higher dimensions can capture more complex relationships but require more data and computation.window
: The maximum distance between the current and predicted word within a sentence.min_count
: Ignores all words with a total frequency lower than this. Helps filter out rare words/typos.workers
: Number of CPU cores to use for training (parallelization).sg
: Training algorithm. 0
for CBOW (Continuous Bag-of-Words), 1
for Skip-gram. Skip-gram often works better for infrequent words, while CBOW is faster.epochs
: Number of iterations (epochs) over the corpus.from gensim.models import Word2Vec
import multiprocessing # To find the number of cores
# Define model parameters
vector_dim = 100 # Dimensionality of the embeddings
window_size = 5 # Context window size
min_word_count = 5 # Minimum word frequency
training_algorithm = 1 # 1 for Skip-gram, 0 for CBOW
num_workers = multiprocessing.cpu_count() # Use all available cores
training_epochs = 10 # Number of training iterations
print("Training Word2Vec model...")
# Initialize and train the model
# Note: Training can take a few minutes depending on your data size and CPU
model = Word2Vec(sentences=processed_sentences,
vector_size=vector_dim,
window=window_size,
min_count=min_word_count,
sg=training_algorithm,
workers=num_workers,
epochs=training_epochs)
print("Model training complete.")
# You can save the trained model for later use
# model.save("brown_word2vec.model")
# To load: model = Word2Vec.load("brown_word2vec.model")
Once the model is trained, we can investigate the learned representations. The model.wv
attribute holds the vocabulary and vectors.
# Access the vector for a specific word
try:
vector_king = model.wv['king']
print(f"Vector for 'king':\n {vector_king[:10]}...") # Print first 10 dimensions
print(f"Shape of 'king' vector: {vector_king.shape}")
except KeyError:
print("'king' not in vocabulary (likely due to min_count or not present in corpus).")
# Find words most similar to a given word
try:
similar_to_woman = model.wv.most_similar('woman', topn=5)
print("\nWords most similar to 'woman':")
for word, score in similar_to_woman:
print(f"- {word}: {score:.4f}")
except KeyError:
print("'woman' not in vocabulary.")
# Explore word analogies: king - man + woman = ?
try:
analogy_result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(f"\nAnalogy 'king' - 'man' + 'woman' ≈ {analogy_result[0][0]} (Score: {analogy_result[0][1]:.4f})")
except KeyError as e:
print(f"\nCould not perform analogy: Word '{e.args[0]}' not in vocabulary.")
# Check if a word is in the vocabulary
print(f"\nIs 'government' in vocabulary? {'government' in model.wv.key_to_index}")
print(f"Vocabulary size: {len(model.wv.key_to_index)}")
The results, especially for analogies, depend heavily on the size and nature of the training data and the chosen hyperparameters. Our model trained only on the Brown Corpus might not capture analogies as well as models trained on gigabytes of text.
Word vectors live in a high-dimensional space (100 dimensions in our example). To visualize them, we need to reduce their dimensionality to 2D or 3D. Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are common techniques for this. t-SNE is often preferred for visualizing local structure and clusters.
Let's visualize a subset of our learned vectors using PCA and Plotly.
import numpy as np
from sklearn.decomposition import PCA
import plotly.graph_objects as go
# Select a subset of words for visualization
words_to_visualize = ['man', 'woman', 'king', 'queen', 'boy', 'girl',
'father', 'mother', 'son', 'daughter', 'uncle', 'aunt',
'dog', 'cat', 'animal', 'pet',
'house', 'home', 'car', 'road', 'city', 'country',
'love', 'hate', 'happy', 'sad']
# Get vectors for the selected words that are in the vocabulary
vectors = []
words = []
for word in words_to_visualize:
if word in model.wv.key_to_index:
vectors.append(model.wv[word])
words.append(word)
if not vectors:
print("None of the selected words for visualization are in the vocabulary.")
else:
vectors = np.array(vectors)
# Reduce dimensions using PCA
pca = PCA(n_components=2)
vectors_2d = pca.fit_transform(vectors)
# Create interactive scatter plot with Plotly
fig = go.Figure(data=go.Scatter(
x=vectors_2d[:, 0],
y=vectors_2d[:, 1],
mode='markers+text',
marker=dict(
size=8,
color='#228be6' # Blue marker color
),
text=words,
textposition='top center'
))
fig.update_layout(
title='Word Embeddings Visualized using PCA (2D)',
xaxis_title='PCA Component 1',
yaxis_title='PCA Component 2',
width=700,
height=600,
template='plotly_white' # Use a clean template
)
# Display the plot (in environments like Jupyter)
# fig.show()
# Or generate the JSON for web embedding
plotly_json = fig.to_json()
print("\nPlotly JSON for visualization (first 500 chars):")
print(plotly_json[:500] + "...") # Print snippet of JSON
# In a web context, you would embed this JSON using Plotly.js
# Example (conceptual):
# ```plotly
# {"layout": {"title": {"text": "Word Embeddings Visualized using PCA (2D)"}, ...}, "data": [{"x": [...], "y": [...], ...}]}
# ```
PCA projection of word vectors trained on the Brown Corpus. Observe how related concepts like ('man', 'woman', 'boy', 'girl') or ('dog', 'cat', 'pet') tend to cluster together, demonstrating that the embeddings capture semantic relationships.
Training embeddings requires significant computational resources and massive datasets to achieve high quality. Often, it's more practical to use pre-trained embeddings released by research institutions. These models are trained on web-scale corpora (like Google News or Wikipedia) and capture rich semantic relationships. gensim
provides convenient access to several popular pre-trained models.
Let's load a smaller GloVe model pre-trained on Wikipedia. Other options include larger GloVe models or Word2Vec models like word2vec-google-news-300
.
import gensim.downloader as api
# List available models (optional)
# print(list(api.info()['models'].keys()))
print("\nLoading pre-trained GloVe model (glove-wiki-gigaword-100)...")
# This will download the model if not present locally (can take time and disk space)
try:
glove_model = api.load("glove-wiki-gigaword-100") # 100-dimensional GloVe vectors
print("Pre-trained GloVe model loaded.")
# Now use it like our own model
print("Vector shape:", glove_model['computer'].shape)
print("\nWords similar to 'technology' (GloVe):")
similar_tech = glove_model.most_similar('technology', topn=5)
for word, score in similar_tech:
print(f"- {word}: {score:.4f}")
print("\nAnalogy 'king' - 'man' + 'woman' ≈ (GloVe):")
analogy_result_glove = glove_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(f"- {analogy_result_glove[0][0]} (Score: {analogy_result_glove[0][1]:.4f})")
except Exception as e:
print(f"Failed to load pre-trained model. Error: {e}")
print("Check your internet connection or try a different model.")
You'll likely observe that the pre-trained model provides more intuitive similarity results and performs better on analogy tasks due to the vast amount of data it was trained on.
This practical exercise demonstrated how to train your own Word2Vec model and, perhaps more importantly for many applications, how to load and utilize powerful pre-trained embeddings. These dense vector representations are fundamental building blocks for many advanced NLP tasks, including the sequence models we will introduce in the next chapter. They provide a way to feed semantic understanding into machine learning algorithms.
© 2025 ApX Machine Learning