Skip to content

Sitemap improvements: lastmod, image entries, noindex on excluded pages#1427

Merged
andygrunwald merged 4 commits intomainfrom
andygrunwald/sitemap-improvements
May 5, 2026
Merged

Sitemap improvements: lastmod, image entries, noindex on excluded pages#1427
andygrunwald merged 4 commits intomainfrom
andygrunwald/sitemap-improvements

Conversation

@andygrunwald
Copy link
Copy Markdown
Contributor

Summary

Four isolated commits on andygrunwald/sitemap-improvements, addressing the sitemap audit:

  1. 3bfe4136 — Emit <lastmod> for content-driven pages. New build-time helper scripts/sitemap-lastmod.mjs walks the relevant content directories synchronously, parses pubDate / date / publishedAt / latestEpisodePublished out of frontmatter or JSON, and exposes a Map<urlPath, ISO> for the sitemap serialize hook. Coverage: podcast episodes, blog posts, meetup events, and the matching index pages (/podcast/, /blog/, /meetup/<flavor>/, /deutsche-tech-podcasts/, /filme-fuer-softwareentwickler/, /spiele-fuer-softwareentwickler/, /). Tag pages and per-genre/category routes fall back to build time. 277 of 464 URLs now carry <lastmod>.

  2. ecf00e96 — Document the exclude list and fix the typo. Renamed exludeFromSitemapexcludeFromSitemap and added a comment block explaining each entry (meetup/<flavor>/promote/ for short-lived QR-code campaign pages, linktree/ for a redirect-only landing). No behaviour change.

  3. 800158d5 — Mark sitemap-excluded pages as noindex,nofollow. Defence-in-depth: the exclude list keeps these URLs out of sitemap-0.xml, but Google can still discover them via inbound QR-code links. Added a noindex boolean prop to MainHead.astro; passed it from the four affected templates (linktree, PromoteSocialImage, PromoteAnnounceNewsletter, PromoteNewsletter). Drive-by: PromoteNewsletter had a dead MainHead import and no <head> at all — added a small inline <head> with the noindex meta and dropped the unused import.

  4. 2274501c — Emit image sitemap entries from each page's og:image. New extractOgImage helper reads the rendered HTML for each URL and attaches one image entry per URL via the image: namespace. 458 of 464 URLs now carry an <image:loc>. Per-page selections look right: episode covers for /podcast/episode/..., hosts photo for /, dedicated headers for the directory pages.

Test plan

  • make build green at 472 pages.
  • Sitemap exclude list still keeps /linktree/ and meetup/*/promote/ out of the sitemap (0 leaks).
  • All 7 excluded URLs now ship <meta name="robots" content="noindex, nofollow">; all other pages keep the original index-promoting tag.
  • Spot-checks on lastmod: /podcast/ = newest episode pubDate, /blog/post/<slug>/ = entry pubDate, /impressum/ correctly lacks a lastmod.
  • Spot-checks on og:image: episode covers come through as /_astro/<slug>.<hash>.jpg; homepage uses the hosts photo.
  • Validate emitted XML against the schemas at https://www.sitemaps.org/protocol.html and https://developers.google.com/search/docs/crawling-indexing/sitemaps/image-sitemaps once deployed.
  • Re-submit the sitemap in Google Search Console after merge.

🤖 Generated with Claude Code

andygrunwald and others added 4 commits May 5, 2026 13:01
Google uses sitemap lastmod as a freshness signal and ignores
changefreq/priority. With ~200 podcast episodes, a daily-syncing tech
podcast index, and an actively curated movie list, every URL today is
just a bare <loc> — search engines have no signal for what changed.

Add a small build-time helper (scripts/sitemap-lastmod.mjs) that walks
the relevant content directories synchronously, parses pubDate / date /
publishedAt / latestEpisodePublished out of frontmatter or JSON, and
exposes a Map<urlPath, ISO> for the sitemap serialize hook to look up.
The map is built once at config load — the per-URL serialize callback
only does Map.get() so the build cost stays negligible.

Coverage:
  - Podcast episodes:    pubDate
  - Blog posts:          pubDate
  - Meetup events:       date
  - /podcast/, /blog/, /meetup/<flavor>/, /deutsche-tech-podcasts/,
    /filme-fuer-softwareentwickler/, /spiele-fuer-softwareentwickler/,
    /:                   max of the underlying entries

Out of scope (v1, defaults to build time): tag pages, per-genre game
pages, per-category and per-type movie pages. They refresh on every
build anyway, so a missing lastmod just costs a freshness signal not
correctness.

277 of 464 URLs now carry <lastmod>. Verified spot checks:
  /podcast/                   2026-05-05  (newest episode)
  /podcast/episode/00-...     2022-02-08  (entry pubDate)
  /deutsche-tech-podcasts/    2026-05-04  (newest tech-podcast episode)
  /                           2026-05-05  (homepage tracks newest episode)
  /impressum/                 <no lastmod, correct>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The exclude array carried no rationale, so a future maintainer adding a
similar page would not know why /linktree/ and meetup .../promote/ are
kept out of the sitemap. Add a comment block explaining each entry plus
the substring-match semantics, and rename `exludeFromSitemap` to
`excludeFromSitemap` while we are touching the file.

No behaviour change — sitemap still emits 464 URLs and none of the
intentionally excluded pages leak in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sitemap exclude list keeps /linktree/ and meetup */promote/* out of
sitemap-0.xml, but Google can still find them through inbound links —
the QR codes we hand out at meetups are exactly that kind of link. The
robots <meta> in MainHead actively asserted "follow, index" on every
page, including these, which contradicted the intent of the exclude
list.

Add a `noindex` boolean prop to MainHead. When true, the component
emits <meta name="robots" content="noindex, nofollow"> instead of the
default index-promoting tag. Pass `noindex` from:
  - src/pages/linktree.astro
  - src/components/meetup/PromoteSocialImage.astro
  - src/components/meetup/PromoteAnnounceNewsletter.astro
  - src/components/meetup/PromoteNewsletter.astro (drive-by: this
    component imported MainHead but never used it; its template had no
    <head> at all, so the page was emitting raw HTML with no robots
    signal. Add a small <head> with the noindex meta directly and drop
    the dead import.)

Verified all 7 excluded URLs now ship a noindex meta; all other pages
keep the original index-promoting tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an `extractOgImage` helper that reads the rendered HTML for a URL
and returns its og:image meta as an absolute URL. The sitemap serialize
hook now attaches one image entry per URL via the `image:` namespace,
giving Google Image search a representative thumbnail per page without
bloating the sitemap with decorative assets like favicons, brand SVGs,
or background patterns.

The pick is good per page type:
  - episode pages   episode cover (Astro-processed _astro/<slug>.<hash>.jpg)
  - homepage        the hosts photo
  - directory pages the German Tech Podcasts / games / movies header

458 of 464 URLs now carry an <image:image><image:loc>...</image:loc>
entry. The 6 without are pages that don't pass `image=` to MainHead;
those just omit the image entry and stay intact otherwise.

The helper runs at sitemap-serialise time, after Astro has written
every page to dist/, so it can read the actual rendered meta. Tiny
regex parser is used because we only need one specific meta tag — not
worth a full DOM parse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented May 5, 2026

Deploy Preview for nifty-bardeen-5c7e53 ready!

Name Link
🔨 Latest commit 2274501
🔍 Latest deploy log https://app.netlify.com/projects/nifty-bardeen-5c7e53/deploys/69f9cf34e474b10008d93e44
😎 Deploy Preview https://deploy-preview-1427--nifty-bardeen-5c7e53.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@andygrunwald andygrunwald merged commit eb91bea into main May 5, 2026
6 checks passed
@andygrunwald andygrunwald deleted the andygrunwald/sitemap-improvements branch May 5, 2026 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant