Mean Average Precision (MAP)

Precision@k measures how many relevant items appear in top recommendations, but it has a significant blind spot: it ignores the order of those items. According to Precision@5, a recommendation list that places a relevant item at position 1 is scored identically to one that places it at position 5. This is not ideal in most applications. Users are far more likely to interact with items at the top of a list, so a good recommender should be rewarded for ranking relevant items higher.

This is precisely the problem that Mean Average Precision (MAP) is designed to address. It is a ranking metric that heavily penalizes models for placing relevant items further down the recommendation list.

From Precision to Average Precision

To understand MAP, we first need to look at Average Precision (AP), which is calculated on a per-user basis. Average Precision for a single user is the average of the Precision@k values computed at each position k that contains a relevant item.

Let's walk through an example. Suppose our model generates a ranked list of 6 movie recommendations for a user. We also know the ground truth, which is the set of movies this user actually watched and liked from our test set.

Recommended List: [Movie C, Movie A, Movie F, Movie B, Movie H, Movie D]
Relevant Items (Ground Truth): {Movie A, Movie B, Movie D}

Now, we iterate through the recommended list and calculate the precision only at the positions where we find a relevant movie:

Position 1 (Movie C): Irrelevant. We do nothing.
Position 2 (Movie A): Relevant. At this point, we have seen 2 items, and 1 of them is relevant.
- Precision@2 = 1/2 = 0.5
Position 3 (Movie F): Irrelevant. We do nothing.
Position 4 (Movie B): Relevant. We have now seen 4 items, and 2 of them are relevant.
- Precision@4 = 2/4 = 0.5
Position 5 (Movie H): Irrelevant. We do nothing.
Position 6 (Movie D): Relevant. We have now seen 6 items, and 3 of them are relevant.
- Precision@6 = 3/6 = 0.5

To get the Average Precision for this user, we average these precision scores. Since there were 3 relevant items in total, we divide the sum of our calculated precisions by 3.

AP = \frac{0.5 + 0.5 + 0.5}{3} = \frac{1.5}{3} = 0.5

Now, imagine a better model that ranks the same relevant items higher.

Recommended List (Model 2): [Movie A, Movie B, Movie C, Movie F, Movie D, Movie H]
Relevant Items (Ground Truth): {Movie A, Movie B, Movie D}

Let's calculate the AP for this new list:

Position 1 (Movie A): Relevant.
- Precision@1 = 1/1 = 1.0
Position 2 (Movie B): Relevant.
- Precision@2 = 2/2 = 1.0
Position 3 (Movie C): Irrelevant.
Position 4 (Movie F): Irrelevant.
Position 5 (Movie D): Relevant.
- Precision@5 = 3/5 = 0.6

Now, we find the average:

AP = \frac{1.0 + 1.0 + 0.6}{3} = \frac{2.6}{3} \approx 0.867

The AP score for Model 2 is significantly higher, correctly reflecting that it produced a better-ordered list of recommendations.

The following diagram illustrates the calculation for both models, showing how higher ranks for relevant items lead to a better AP score.

Comparing Average Precision for two different recommendation models for the same user. Model 2 achieves a higher score because it ranks the relevant items (A, B) higher on the list.

The formal definition for Average Precision is:

AP = \frac{1}{R} \sum_{k=1}^{N} P(k) \times rel(k)

Where:

$N$ is the number of recommendations in the list.
$R$ is the total number of relevant items for the user.
$P(k)$ is the precision at cutoff $k$ .
$rel(k)$ is an indicator function which is 1 if the item at rank $k$ is relevant, and 0 otherwise.

Calculating Mean Average Precision

Average Precision gives us a score for a single user. To get a single metric that describes the performance of our entire model, we calculate the AP for every user in our test set and then take the average of all these scores. This final value is the Mean Average Precision (MAP).

MAP = \frac{1}{|U|} \sum_{u \in U} AP_u

Where:

$U$ is the set of all users in the test set.
$|U|$ is the total number of users.
$AP_u$ is the Average Precision for user $u$ .

A MAP score ranges from 0 to 1, where a higher value indicates a better model. A score of 1.0 would mean that the model perfectly ranked all relevant items at the very top of the list for every single user.

Implementing Average Precision in Python

Let's translate the logic into a simple Python function. This function takes a list of recommended items and a set of relevant items and computes the AP.

import numpy as np

def average_precision(recommended_items, relevant_items):
    """
    Calculates the Average Precision (AP) for a single recommendation list.

    Args:
        recommended_items (list): A ranked list of recommended item IDs.
        relevant_items (set): A set of relevant item IDs (ground truth).

    Returns:
        float: The Average Precision score.
    """
    if not relevant_items:
        return 0.0

    # Store precision values at each relevant position
    precision_scores = []
    num_hits = 0

    for i, item_id in enumerate(recommended_items):
        if item_id in relevant_items:
            num_hits += 1
            precision_at_k = num_hits / (i + 1)
            precision_scores.append(precision_at_k)

    if not precision_scores:
        return 0.0

    # AP is the mean of precision scores at relevant positions
    return np.mean(precision_scores)

# Example from Model 2
recommended = ['Movie A', 'Movie B', 'Movie C', 'Movie F', 'Movie D', 'Movie H']
relevant = {'Movie A', 'Movie B', 'Movie D'}

ap_score = average_precision(recommended, relevant)
print(f"AP Score for Model 2: {ap_score:.4f}")
# Expected output: AP Score for Model 2: 0.8667

To get the MAP score for your system, you would run this function for each user in your test set and then compute the mean of all the returned AP scores. MAP is a standard and effective metric for any recommendation task where the ranking of items is a primary concern.

Was this section helpful?

References

Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, 2008 (Cambridge University Press) - This foundational textbook provides a comprehensive treatment of information retrieval metrics, including Mean Average Precision, with theoretical explanations and examples.
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeff Ullman, 2020 (Cambridge University Press) - This book covers large-scale data analysis, including algorithms for recommender systems and their evaluation metrics such as MAP.