Skip to content

Enrich drift Slack alert and harden it against transient API blips#252

Merged
jpr5 merged 2 commits into
mainfrom
feat/drift-slack-detail
Jun 8, 2026
Merged

Enrich drift Slack alert and harden it against transient API blips#252
jpr5 merged 2 commits into
mainfrom
feat/drift-slack-detail

Conversation

@jpr5

@jpr5 jpr5 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Two improvements to the scheduled drift-detection workflow (.github/workflows/test-drift.yml), both on the same file and theme:

  1. Retry-before-alert (transient-blip hardening). A single critical run from the collector can be a transient real-API hiccup (a streaming call failing mid-flight, no terminal event) rather than a genuine format change — and it pages the team with a false alarm. The drift job now runs the collector behind a wrapper that re-runs on critical drift and only alerts if the drift persists across every attempt.
  2. Per-provider Slack detail (original enrichment). The alert now carries a short, scannable summary of which providers drifted and what changed, instead of forcing a click-through.

1. Retry-before-alert

The problem. On 2026-06-08 the Drift Tests workflow posted a 🚨 false alarm: a single OpenAI Responses /v1/responses streaming call failed mid-flight (emitted error + response.failed, no content events, no terminal response.completed). That looks identical to a critical diff to the collector, but it was OpenAI's API hiccuping, not a format change. Proof it was transient: the separate Fix Drift workflow re-ran the SAME collector ~1 minute later and got Critical diffs: 0. 11 straight green days preceded it.

The fix. New scripts/drift-retry.ts wraps scripts/drift-report-collector.ts with a "retry before alert" policy:

  • Collector exit 0 → no critical drift → SUCCESS, no further runs (the common green path stays fast — zero extra real-API calls).
  • Collector exit 2 → critical drift → retry up to 3 total attempts with a ~45s backoff between attempts. As soon as ANY attempt returns 0 critical → transient → SUCCESS, no alert.
  • Only when every attempt shows critical drift → propagate exit 2 → the drift job fails → the notify job alerts.
  • Collector exit 1 (or any other non-0/2 code) → script/infra crash, not drift → propagate immediately, no retry (retrying won't help; it's a real break worth surfacing).

The retry decision lives in a pure, dependency-injected function (retryUntilStable) — collector-runner, sleep, and logger are all injected — so it is unit-tested without spawning subprocesses or sleeping. main() wires the real collector subprocess (via spawnSync, stdio inherited) and a synchronous Atomics.wait backoff, and emits a drift_runs GITHUB_OUTPUT marker recording how many runs confirmed the drift. The drift job exposes that as a job output so the alert can note "(confirmed across N runs)" when it does fire. The wrapper preserves the collector's exit-code contract (0 / 2 / other), so the surrounding YAML is unchanged.

Fix Drift interaction. Fix Drift triggers on workflow_run when Drift Tests concludes failure. Because transient blips no longer fail Drift Tests, Fix Drift no longer runs needlessly on a one-off hiccup — it only runs after persistent drift has already been confirmed. The interaction is preserved, not broken or duplicated: Fix Drift still runs the collector directly (it's gated on exit_code == 2 for the actual fix), which is correct because by the time it runs the drift was already confirmed persistent.

2. Per-provider Slack detail (enrichment)

  • scripts/drift-slack-summary.ts reads the drift-report.json the collector produces, groups diffs by provider, and emits a compact Slack mrkdwn summary: one bullet per provider with a severity tally and a few representative changed field paths (capped for readability; full detail stays in the uploaded drift-report artifact + the View run link). Degrades gracefully to a generic message if the report is missing/malformed.
  • The drift job exposes the summary as a job output (outputs.summary); the notify job inserts it as a detail block in all three drift headlines. The persistent-drift alert carries this enriched detail plus the "confirmed across N runs" note.

Example enriched + confirmed message

🚨 *HTTP API drift detected* in aimock — providers changed response formats.
• *OpenAI Chat* — 1 critical, 1 warning: `choices[0].message.refusal`, `choices[0].message.role`
• *Anthropic* — 1 critical: `content[0].thinking`
_(confirmed across 3 runs)_
<https://github.com/CopilotKit/aimock/actions/runs/123456|View run>

Newline correctness (the GHA gotcha)

The Slack payload is built in bash and encoded with jq -n --arg, using real bash LF ($'\n') which jq encodes as proper JSON \n. There is no literal \n inside any format()/toJSON() expression (the GitHub Actions trap that renders a visible backslash-n).

Test plan

  • Red-green unit tests for retryUntilStable in src/__tests__/drift-retry.test.ts (8 tests): single clean run = no retry; critical→clean = transient success; all-critical = exit 2 with criticalRuns count; collector crash (exit 1) and unexpected codes propagate immediately without retry; backoff fires only between attempts (not before first / after last); per-attempt exit codes recorded in order. Verified the suite goes red when the clean-exit early-return is mutated out.
  • Red-green unit tests for summarizeDriftReport in src/__tests__/drift-scripts.test.ts (provider naming, severity tally, multi-entry merge, multi-provider ordering, path capping, real-newline assertion).
  • pnpm test — full suite green (3367 tests, 94 files).
  • pnpm build green; npx tsc --noEmit clean.
  • prettier --check and eslint clean on changed files.
  • actionlint introduces no new findings (one pre-existing info-level SC2086 on the untouched prev step remains; the step I edited is now fully quoted); zizmor --min-severity medium clean.
  • Verified retryUntilStable end-to-end via tsx: transient → exit 0, persistent → exit 2 with criticalRuns=3, real Atomics.wait backoff timing.

Notes / limitations

  • CI-workflow + tooling only: no version bump, no CHANGELOG entry, no release.
  • Retries hit real provider APIs, so maxAttempts is kept small (3). A persistent format change costs at most 3 collector runs before alerting (~1.5 min extra); transient blips clear on the 2nd run.
  • The "confirmed across N runs" note appears only on HTTP API drift headlines (the path that goes through the retry wrapper). AG-UI-only schema drift is deterministic (no real-API call) and does not carry the note.

jpr5 added 2 commits June 8, 2026 09:18
…changed

The #oss-alerts drift notification previously said only "providers changed
response formats" with a View run link, forcing a click-through to learn
anything. Distill drift-report.json into a short, scannable Slack mrkdwn
summary (per-provider severity tally + a few example changed field paths,
capped to stay readable) and thread it into the alert.

The summary is computed in the `drift` job (which has the report file) and
surfaced as a job output that the separate `notify` job consumes. The
View run link and the uploaded artifact still carry full detail.

Newlines in the Slack payload use real bash LF that jq's --arg encodes as
proper JSON \n escapes — no literal "\n" in any format()/toJSON expression
(the GitHub Actions gotcha that renders a visible backslash-n in Slack).
…blips

A single critical run from drift-report-collector.ts can be a transient
real-API hiccup (a streaming call failing mid-flight with no terminal
event) rather than a genuine format change. Alerting on it pages the team
with a false alarm.

Add scripts/drift-retry.ts, a "retry before alert" wrapper around the
collector: on critical drift (exit 2) it re-runs the collector up to 3
total attempts with a ~45s backoff, and declares success the moment any
attempt returns 0 critical (transient). It only propagates exit 2 — which
fails the Drift Tests job and triggers the Slack alert — when drift
PERSISTS across every attempt. A clean first run takes the fast path with
no extra real-API calls; collector crashes (exit 1) propagate immediately
without retry. The persistent-drift count is surfaced so the alert can note
"confirmed across N runs".

The retry decision is a pure, dependency-injected function
(retryUntilStable) with red-green unit tests in drift-retry.test.ts.
The Drift Tests workflow now invokes the wrapper instead of the collector
directly. Because transient blips no longer fail Drift Tests, the Fix Drift
workflow (triggered on Drift Tests failure) no longer runs needlessly on a
one-off API hiccup.
@jpr5 jpr5 changed the title Enrich drift Slack alert with which providers drifted and what changed Enrich drift Slack alert and harden it against transient API blips Jun 8, 2026
@pkg-pr-new

pkg-pr-new Bot commented Jun 8, 2026

Copy link
Copy Markdown

Open in StackBlitz

npm i https://pkg.pr.new/@copilotkit/aimock@252

commit: 319e6f9

@jpr5 jpr5 merged commit 92cdf1a into main Jun 8, 2026
25 checks passed
@jpr5 jpr5 deleted the feat/drift-slack-detail branch June 8, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant