Enrich drift Slack alert and harden it against transient API blips#252
Merged
Conversation
…changed The #oss-alerts drift notification previously said only "providers changed response formats" with a View run link, forcing a click-through to learn anything. Distill drift-report.json into a short, scannable Slack mrkdwn summary (per-provider severity tally + a few example changed field paths, capped to stay readable) and thread it into the alert. The summary is computed in the `drift` job (which has the report file) and surfaced as a job output that the separate `notify` job consumes. The View run link and the uploaded artifact still carry full detail. Newlines in the Slack payload use real bash LF that jq's --arg encodes as proper JSON \n escapes — no literal "\n" in any format()/toJSON expression (the GitHub Actions gotcha that renders a visible backslash-n in Slack).
…blips A single critical run from drift-report-collector.ts can be a transient real-API hiccup (a streaming call failing mid-flight with no terminal event) rather than a genuine format change. Alerting on it pages the team with a false alarm. Add scripts/drift-retry.ts, a "retry before alert" wrapper around the collector: on critical drift (exit 2) it re-runs the collector up to 3 total attempts with a ~45s backoff, and declares success the moment any attempt returns 0 critical (transient). It only propagates exit 2 — which fails the Drift Tests job and triggers the Slack alert — when drift PERSISTS across every attempt. A clean first run takes the fast path with no extra real-API calls; collector crashes (exit 1) propagate immediately without retry. The persistent-drift count is surfaced so the alert can note "confirmed across N runs". The retry decision is a pure, dependency-injected function (retryUntilStable) with red-green unit tests in drift-retry.test.ts. The Drift Tests workflow now invokes the wrapper instead of the collector directly. Because transient blips no longer fail Drift Tests, the Fix Drift workflow (triggered on Drift Tests failure) no longer runs needlessly on a one-off API hiccup.
commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two improvements to the scheduled drift-detection workflow (
.github/workflows/test-drift.yml), both on the same file and theme:driftjob now runs the collector behind a wrapper that re-runs on critical drift and only alerts if the drift persists across every attempt.1. Retry-before-alert
The problem. On 2026-06-08 the
Drift Testsworkflow posted a 🚨 false alarm: a single OpenAI Responses/v1/responsesstreaming call failed mid-flight (emittederror+response.failed, no content events, no terminalresponse.completed). That looks identical to a critical diff to the collector, but it was OpenAI's API hiccuping, not a format change. Proof it was transient: the separateFix Driftworkflow re-ran the SAME collector ~1 minute later and gotCritical diffs: 0. 11 straight green days preceded it.The fix. New
scripts/drift-retry.tswrapsscripts/drift-report-collector.tswith a "retry before alert" policy:driftjob fails → thenotifyjob alerts.The retry decision lives in a pure, dependency-injected function (
retryUntilStable) — collector-runner, sleep, and logger are all injected — so it is unit-tested without spawning subprocesses or sleeping.main()wires the real collector subprocess (viaspawnSync, stdio inherited) and a synchronousAtomics.waitbackoff, and emits adrift_runsGITHUB_OUTPUT marker recording how many runs confirmed the drift. Thedriftjob exposes that as a job output so the alert can note "(confirmed across N runs)" when it does fire. The wrapper preserves the collector's exit-code contract (0 / 2 / other), so the surrounding YAML is unchanged.Fix Drift interaction.
Fix Drifttriggers onworkflow_runwhenDrift Testsconcludesfailure. Because transient blips no longer failDrift Tests,Fix Driftno longer runs needlessly on a one-off hiccup — it only runs after persistent drift has already been confirmed. The interaction is preserved, not broken or duplicated:Fix Driftstill runs the collector directly (it's gated onexit_code == 2for the actual fix), which is correct because by the time it runs the drift was already confirmed persistent.2. Per-provider Slack detail (enrichment)
scripts/drift-slack-summary.tsreads thedrift-report.jsonthe collector produces, groups diffs by provider, and emits a compact Slack mrkdwn summary: one bullet per provider with a severity tally and a few representative changed field paths (capped for readability; full detail stays in the uploadeddrift-reportartifact + the View run link). Degrades gracefully to a generic message if the report is missing/malformed.driftjob exposes the summary as a job output (outputs.summary); thenotifyjob inserts it as a detail block in all three drift headlines. The persistent-drift alert carries this enriched detail plus the "confirmed across N runs" note.Example enriched + confirmed message
Newline correctness (the GHA gotcha)
The Slack payload is built in bash and encoded with
jq -n --arg, using real bash LF ($'\n') which jq encodes as proper JSON\n. There is no literal\ninside anyformat()/toJSON()expression (the GitHub Actions trap that renders a visible backslash-n).Test plan
retryUntilStableinsrc/__tests__/drift-retry.test.ts(8 tests): single clean run = no retry; critical→clean = transient success; all-critical = exit 2 withcriticalRunscount; collector crash (exit 1) and unexpected codes propagate immediately without retry; backoff fires only between attempts (not before first / after last); per-attempt exit codes recorded in order. Verified the suite goes red when the clean-exit early-return is mutated out.summarizeDriftReportinsrc/__tests__/drift-scripts.test.ts(provider naming, severity tally, multi-entry merge, multi-provider ordering, path capping, real-newline assertion).pnpm test— full suite green (3367 tests, 94 files).pnpm buildgreen;npx tsc --noEmitclean.prettier --checkandeslintclean on changed files.actionlintintroduces no new findings (one pre-existing info-level SC2086 on the untouchedprevstep remains; the step I edited is now fully quoted);zizmor --min-severity mediumclean.retryUntilStableend-to-end via tsx: transient → exit 0, persistent → exit 2 withcriticalRuns=3, realAtomics.waitbackoff timing.Notes / limitations
maxAttemptsis kept small (3). A persistent format change costs at most 3 collector runs before alerting (~1.5 min extra); transient blips clear on the 2nd run.