Skip to content

Commit 88c9468

Browse files
committed
Niche topic centroid targeting blog
I plan to generate a graphic for this post from real embedding classification clusters just to illustrate its effectiveness.
1 parent 9a44fc3 commit 88c9468

File tree

1 file changed

+113
-0
lines changed

1 file changed

+113
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
Title: Contextual Topic Targeting with Embedding Centroids
2+
Date: August 5, 2025
3+
description: This post shows how we improved our contextual targeting to handle hundreds of developer-specific topic niches with embeddings, pgvector, and centroids.
4+
tags: content-targeting, engineering, postgresql
5+
authors: David Fischer
6+
image: /images/posts/niche-targeting.png
7+
8+
9+
Going back to our [original vision for EthicalAds]({filename}../pages/vision.md),
10+
our goal has always been to show the best ad possible based on page context rather than users' private details.
11+
By delivering the best possible ad on a given page,
12+
this will result in great advertiser performance with high earnings for the sites
13+
where the ads appear and without compromising privacy.
14+
15+
However, our approach to best fulfill that vision has changed over time.
16+
The tools available to target contextually are rapidly improving
17+
with the advances in language models (LLMs).
18+
This post is going to delve into how to use those advances for ad targeting
19+
but similar approaches can be used for many types of classifiers.
20+
21+
22+
## Historical context and scaling topic classification
23+
24+
A few years back, we built [our first topic classifier](https://www.ethicalads.io/blog/2022/11/a-new-approach-to-content-based-targeting-for-advertising/)
25+
that essentially bundled content and keywords together into topics that advertisers could target and buy.
26+
To give a few examples, in addition to our [core audiences]({filename}../pages/advertisers.md#audiences),
27+
this allowed advertisers to target database related content or blockchain related content with relevant ads.
28+
This approach scaled well up to about 15-20 topics which was great for ad performance.
29+
However, adding another topic to target involved not just adding training set examples for that topic
30+
but also making sure any of our existing examples that also applied to the new topic were marked appropriately.
31+
Scaling became a pain.
32+
33+
Last year, we built a more advanced way of targeting very specific content with language model embeddings
34+
that we called [niche targeting]({filename}../pages/niche-targeting.md)
35+
(see our [blog]({filename}../posts/2024-niche-ad-targeting.md) with more details).
36+
This approach worked by targeting pages similar to an advertiser's specific landing and product pages.
37+
Using this approach, we saw ad performance 25-30% better in most cases.
38+
However, campaign sizes were very limited, because there just aren't enough very similar pages and
39+
it was hard to fill campaign sizes advertisers wanted to run.
40+
It also was simply harder to explain how this worked to marketers which made it harder to sell despite strong performance.
41+
42+
43+
## Hybrid approach with embedding centroids
44+
45+
After generating embeddings for nearly a million pages across our network,
46+
clusters started to emerge of related content.
47+
Think of Kubernetes related content clustering together
48+
and Python related content clustering together in a different section.
49+
A centroid is simply the average of these embeddings: a single vector that represents the center of that topic cluster.
50+
51+
New content that's semantically similar will automatically fall close to related content in the embedding space.
52+
Just as before with our topic classifier model, this let us sell advertisers on the topic they're looking for.
53+
But unlike the previous approach, you only need to classify a few tens of pages of content for a new centroid to start taking shape. This scales much better to hundreds of topics or more.
54+
It's also far easier to explain to advertisers that we are targeting content related to the right topic for their product.
55+
56+
To show some concrete code examples, here's a code example of generating a centroid for a number of manually classified embeddings with [pgvector](https://github.com/pgvector/pgvector-python) and Django:
57+
58+
```python
59+
from django.db import models
60+
61+
# Human classifications that match an embedding to a topic
62+
from .models import TopicEmbedding
63+
64+
# Get the centroid for DevOps related content
65+
embeddings = TopicEmbedding.objects.filter(topic="devops").values("embedding__vector")
66+
embeddings.aggregate(
67+
centroid=models.Avg("embedding__vector"),
68+
)["centroid"]
69+
```
70+
71+
When classifying new content (a new embedding), it's easy to see how similar it is to all of the topic centroids.
72+
This essentially answers the question of "how DevOps-ey is this content" or "how Frontend-ey is this content"
73+
for all possibly topics.
74+
75+
```python
76+
from pgvector.django import CosineDistance
77+
78+
from .models import TopicCentroid
79+
80+
# New embedding to classify
81+
vector = [-1.457664e-02, 3.473443e-02, ...]
82+
83+
# Closer than this threshold implies the content is related.
84+
# This threshold differs based on your embedding model.
85+
distance_threshold = 0.45
86+
87+
TopicCentroid.objects.all().annotate(
88+
distance=CosineDistance("vector", vector)
89+
).filter(distance__lte=distance_threshold).order_by("distance")
90+
```
91+
92+
This approach yields all the benefits of using embeddings like much better semantic relevance than simple keywords
93+
while still being explainable like simple keyword targeting used in search ads.
94+
It also scales perfectly fine with any number of topics
95+
and new content just gets an embedding and gets matched and clustered automatically.
96+
As more content is manually classified and added to the centroid, the centroid better reflects that topic
97+
and classifications for that topic improve over time.
98+
Adding new topics for classification
99+
100+
101+
## Conclusion
102+
103+
From the moment we started using embeddings for ad targeting,
104+
we recognized they had great potential for improving contextual targeting performance for advertisers.
105+
Better ad performance means we can generate more money for the sites that host our ads
106+
which is a great virtuous cycle.
107+
108+
With this approach using centroids, we hope to have another virtuous cycle
109+
where our classifications improve over time as we classify more content.
110+
111+
Simon Willison's [blog post on embeddings](https://simonwillison.net/2023/Oct/23/embeddings/)
112+
as well as some of his other posts and presentations have been very influential
113+
in honing our approach. Thanks Simon!

0 commit comments

Comments
 (0)