Creating Item Profiles from Metadata

The core of content-based filtering lies in its ability to understand and quantify what an item is. Before we can recommend a movie similar to Avatar, we first need a structured way to describe Avatar's characteristics. This is the purpose of an item profile: to translate an item's properties, or metadata, into a consistent format that a machine learning model can process. An item profile is the digital fingerprint of an item, capturing its most important attributes.

From Metadata to Machine-Readable Features

Most datasets contain rich metadata that can be used to build these profiles. For a movie dataset, this information might include columns for genres, director, cast, and a plot summary. For an e-commerce product, it could be the brand, category, color, and technical specifications. Our first task is to select the most relevant pieces of metadata and combine them into a single, unified representation for each item.

For example, a simplified dataset of movies:

Movie ID	Title	Genres	Director	Keywords
1	Avatar	Action, Adventure, Sci-Fi	James Cameron	alien, planet, marine
2	Titanic	Drama, Romance	James Cameron	ship, iceberg, artist
3	The Dark Knight	Action, Drama, Thriller	Christopher Nolan	gotham, joker, batman

The raw metadata in columns like Genres, Director, and Keywords is our starting point. However, in its current form, it's not ideal for direct comparison. For instance, how do we prevent the system from confusing director "James Cameron" with another director named "James Gunn"? A common preprocessing step is to merge multi-word names into single tokens (e.g., "James Cameron" becomes "JamesCameron"). Similarly, we should convert all text to a consistent case, usually lowercase, to ensure "Action" and "action" are treated as the same feature.

Creating a Unified Feature String

A straightforward and effective technique for building an item profile is to combine all the processed textual and categorical features into a single string, sometimes called a "metadata soup." This string serves as a comprehensive description of the item.

Let's apply this to our movie data. We'll take the values from the Genres, Director, and Keywords columns, clean them up (lowercase, remove spaces), and concatenate them.

import pandas as pd

# Sample movie data
data = {
    'movie_id': [1, 2, 3],
    'title': ['Avatar', 'Titanic', 'The Dark Knight'],
    'genres': ['Action, Adventure, Sci-Fi', 'Drama, Romance', 'Action, Drama, Thriller'],
    'director': ['James Cameron', 'James Cameron', 'Christopher Nolan'],
    'keywords': ['alien, planet, marine', 'ship, iceberg, artist', 'gotham, joker, batman']
}
movies_df = pd.DataFrame(data)

# Function to clean and combine features
def create_feature_soup(row):
    # Combine all desired features into a single string
    # We clean the data by converting to lowercase and removing spaces
    genres = ' '.join(row['genres'].replace(', ', ' ').lower().split())
    director = row['director'].replace(' ', '').lower()
    keywords = ' '.join(row['keywords'].replace(', ', ' ').lower().split())
    
    return f"{genres} {director} {keywords}"

# Apply the function to create a new column with the combined features
movies_df['profile_features'] = movies_df.apply(create_feature_soup, axis=1)

print(movies_df[['title', 'profile_features']])

The resulting DataFrame would look like this:

title	profile_features
Avatar	action adventure sci-fi jamescameron alien planet marine
Titanic	drama romance jamescameron ship iceberg artist
The Dark Knight	action drama thriller christophernolan gotham joker batman

Each entry in the profile_features column is now a standardized, descriptive profile for a movie. The process is illustrated in the diagram below.

The transformation of raw, disparate metadata into a single, clean feature string for each item.

By creating these unified profiles, we have successfully represented each item's characteristics in a consistent textual format. We've established a solid foundation for comparison. However, a string of words is still not something we can use in a mathematical formula like cosine similarity. The next logical step is to convert these textual profiles into numerical vectors, a task perfectly suited for the TF-IDF technique, which we will cover in the following section.

Was this section helpful?

References

Recommender Systems: An Introduction, Francesco Ricci, Lior Rokach, Bracha Shapira, 2022 (Springer) DOI: 10.1007/978-3-030-97500-1 - This book provides a comprehensive overview of recommender systems, including content-based filtering, item representation, and feature extraction techniques from metadata.
Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, Alice Zheng, Amanda Casari, 2018 (O'Reilly Media) - This book offers practical guidance on transforming raw data, including categorical and text data, into features suitable for machine learning models, directly applicable to creating effective item profiles.