|
| 1 | +Title: Contextual Topic Targeting with Embedding Centroids |
| 2 | +Date: August 5, 2025 |
| 3 | +description: This post shows how we improved our contextual targeting to handle hundreds of developer-specific topic niches with embeddings, pgvector, and centroids. |
| 4 | +tags: content-targeting, engineering, postgresql |
| 5 | +authors: David Fischer |
| 6 | +image: /images/posts/niche-targeting.png |
| 7 | + |
| 8 | + |
| 9 | +Going back to our [original vision for EthicalAds]({filename}../pages/vision.md), |
| 10 | +our goal has always been to show the best ad possible based on page context rather than users' private details. |
| 11 | +By delivering the best possible ad on a given page, |
| 12 | +this will result in great advertiser performance with high earnings for the sites |
| 13 | +where the ads appear and without compromising privacy. |
| 14 | + |
| 15 | +However, our approach to best fulfill that vision has changed over time. |
| 16 | +The tools available to target contextually are rapidly improving |
| 17 | +with the advances in language models (LLMs). |
| 18 | +This post is going to delve into how to use those advances for ad targeting |
| 19 | +but similar approaches can be used for many types of classifiers. |
| 20 | + |
| 21 | + |
| 22 | +## Historical context and scaling topic classification |
| 23 | + |
| 24 | +A few years back, we built [our first topic classifier](https://www.ethicalads.io/blog/2022/11/a-new-approach-to-content-based-targeting-for-advertising/) |
| 25 | +that essentially bundled content and keywords together into topics that advertisers could target and buy. |
| 26 | +To give a few examples, in addition to our [core audiences]({filename}../pages/advertisers.md#audiences), |
| 27 | +this allowed advertisers to target database related content or blockchain related content with relevant ads. |
| 28 | +This approach scaled well up to about 15-20 topics which was great for ad performance. |
| 29 | +However, adding another topic to target involved not just adding training set examples for that topic |
| 30 | +but also making sure any of our existing examples that also applied to the new topic were marked appropriately. |
| 31 | +Scaling became a pain. |
| 32 | + |
| 33 | +Last year, we built a more advanced way of targeting very specific content with language model embeddings |
| 34 | +that we called [niche targeting]({filename}../pages/niche-targeting.md) |
| 35 | +(see our [blog]({filename}../posts/2024-niche-ad-targeting.md) with more details). |
| 36 | +This approach worked by targeting pages similar to an advertiser's specific landing and product pages. |
| 37 | +Using this approach, we saw ad performance 25-30% better in most cases. |
| 38 | +However, campaign sizes were very limited, because there just aren't enough very similar pages and |
| 39 | +it was hard to fill campaign sizes advertisers wanted to run. |
| 40 | +It also was simply harder to explain how this worked to marketers which made it harder to sell despite strong performance. |
| 41 | + |
| 42 | + |
| 43 | +## Hybrid approach with embedding centroids |
| 44 | + |
| 45 | +After generating embeddings for nearly a million pages across our network, |
| 46 | +clusters started to emerge of related content. |
| 47 | +Think of Kubernetes related content clustering together |
| 48 | +and Python related content clustering together in a different section. |
| 49 | +A centroid is simply the average of these embeddings: a single vector that represents the center of that topic cluster. |
| 50 | + |
| 51 | +New content that's semantically similar will automatically fall close to related content in the embedding space. |
| 52 | +Just as before with our topic classifier model, this let us sell advertisers on the topic they're looking for. |
| 53 | +But unlike the previous approach, you only need to classify a few tens of pages of content for a new centroid to start taking shape. This scales much better to hundreds of topics or more. |
| 54 | +It's also far easier to explain to advertisers that we are targeting content related to the right topic for their product. |
| 55 | + |
| 56 | +To show some concrete code examples, here's a code example of generating a centroid for a number of manually classified embeddings with [pgvector](https://github.com/pgvector/pgvector-python) and Django: |
| 57 | + |
| 58 | +```python |
| 59 | +from django.db import models |
| 60 | + |
| 61 | +# Human classifications that match an embedding to a topic |
| 62 | +from .models import TopicEmbedding |
| 63 | + |
| 64 | +# Get the centroid for DevOps related content |
| 65 | +embeddings = TopicEmbedding.objects.filter(topic="devops").values("embedding__vector") |
| 66 | +embeddings.aggregate( |
| 67 | + centroid=models.Avg("embedding__vector"), |
| 68 | +)["centroid"] |
| 69 | +``` |
| 70 | + |
| 71 | +When classifying new content (a new embedding), it's easy to see how similar it is to all of the topic centroids. |
| 72 | +This essentially answers the question of "how DevOps-ey is this content" or "how Frontend-ey is this content" |
| 73 | +for all possibly topics. |
| 74 | + |
| 75 | +```python |
| 76 | +from pgvector.django import CosineDistance |
| 77 | + |
| 78 | +from .models import TopicCentroid |
| 79 | + |
| 80 | +# New embedding to classify |
| 81 | +vector = [-1.457664e-02, 3.473443e-02, ...] |
| 82 | + |
| 83 | +# Closer than this threshold implies the content is related. |
| 84 | +# This threshold differs based on your embedding model. |
| 85 | +distance_threshold = 0.45 |
| 86 | + |
| 87 | +TopicCentroid.objects.all().annotate( |
| 88 | + distance=CosineDistance("vector", vector) |
| 89 | +).filter(distance__lte=distance_threshold).order_by("distance") |
| 90 | +``` |
| 91 | + |
| 92 | +This approach yields all the benefits of using embeddings like much better semantic relevance than simple keywords |
| 93 | +while still being explainable like simple keyword targeting used in search ads. |
| 94 | +It also scales perfectly fine with any number of topics |
| 95 | +and new content just gets an embedding and gets matched and clustered automatically. |
| 96 | +As more content is manually classified and added to the centroid, the centroid better reflects that topic |
| 97 | +and classifications for that topic improve over time. |
| 98 | +Adding new topics for classification |
| 99 | + |
| 100 | + |
| 101 | +## Conclusion |
| 102 | + |
| 103 | +From the moment we started using embeddings for ad targeting, |
| 104 | +we recognized they had great potential for improving contextual targeting performance for advertisers. |
| 105 | +Better ad performance means we can generate more money for the sites that host our ads |
| 106 | +which is a great virtuous cycle. |
| 107 | + |
| 108 | +With this approach using centroids, we hope to have another virtuous cycle |
| 109 | +where our classifications improve over time as we classify more content. |
| 110 | + |
| 111 | +Simon Willison's [blog post on embeddings](https://simonwillison.net/2023/Oct/23/embeddings/) |
| 112 | +as well as some of his other posts and presentations have been very influential |
| 113 | +in honing our approach. Thanks Simon! |
0 commit comments