Run a dedicated regression workflow on every pull request merge so the project can detect:
- produce and consume performance regressions
- functional test regressions
- CPU and memory instability, including memory leaks
- broken backpressure on producer and consumer paths
- CI already starts Kafka in GitHub Actions with
docker compose up --build --force-recreate -d --waitin.github/workflows/ci.yml. - The repo already contains producer and consumer benchmark scripts in
benchmarks/producer-single.ts,benchmarks/producer-batch.ts, andbenchmarks/consumer.ts. - The repo already contains targeted backpressure tests:
test/clients/consumer/messages-stream-backpressure.test.tstest/clients/consumer/messages-stream-backpressure-memory.test.tstest/memory/messages-stream-backpressure.memory-test.ts
- There is already reconnect coverage in
test/clients/consumer/messages-stream.test.ts. - The repo already has a 3-broker cluster in
docker-compose.yml, which is important for realistic rebalance and memory tests.
Issue history and current tests show that the plan should include more than the four initial categories.
Add explicit regression coverage for:
- rebalance races causing stale offsets or duplicate consumption
- closed issue
#223
- closed issue
- stale epoch and leader epoch handling during startup and fetch
- open issue
#267 - closed issue
#248
- open issue
- metadata refresh resilience when bootstrap brokers fail but discovered brokers are healthy
- closed issue
#232
- closed issue
- automatic recovery after transient network failures or stream disconnects
- existing test coverage in
messages-stream.test.ts - related closed issue
#206
- existing test coverage in
- SASL reauthentication deadlocks
- closed issue
#226
- closed issue
- batch consumption stalls
- closed issue
#228
- closed issue
- deserialization failures under heavy load
- closed issue
#227
- closed issue
- consumer backpressure regressions causing unbounded buffering, stalls, or memory growth
- closed issue
#260
- closed issue
minBytesand fetch tuning regressions that create excessive bandwidth or CPU load- closed issue
#99
- closed issue
- per-partition backpressure fairness
- open issue
#128
- offset commit correctness across manual commit, timed autocommit, and close paths
- message loss or duplicate delivery during failure or recovery scenarios
- idempotent and transactional producer reliability
- shutdown and cleanup stability, including leaked timers or pending handles
- authentication-path reliability across SASL/OAUTHBEARER, SCRAM, and GSSAPI
- schema-registry and deserializer integration under load
- cross-version compatibility on oldest and newest supported Kafka versions
- control-batch and tombstone correctness
Create a separate workflow, for example .github/workflows/regression.yml.
Trigger it on:
pushtomainworkflow_dispatch
Run it only:
- when code is merged to
main - when manually triggered via GitHub Actions
Scope of each run:
- execute correctness, backpressure, memory, performance, compatibility, and issue-driven regression coverage as configured for this workflow
- compare results against the stored baseline for the same lane
Run this workflow on a custom self-hosted GitHub Actions runner hosted on a dedicated Linux server.
Reason:
- performance and resource measurements are too noisy on shared GitHub-hosted runners
- a dedicated Linux host gives more stable CPU, memory, I/O, network, and Docker behavior
- the regression baseline is only meaningful if the execution environment is kept stable over time
Recommended order:
- provision a dedicated Linux server and register it as a self-hosted GitHub Actions runner
- run the regression workflow only on that runner
- keep Docker, Node.js, Kafka image versions, host sizing, and kernel settings stable over time
- pin Node.js and Kafka versions for regression comparisons
Cluster strategy:
- correctness smoke jobs can use the existing single-broker services where suitable
- rebalance, leak, and backpressure soak jobs should use the existing 3-broker cluster
- SASL reauthentication jobs should use
broker-sasl
Version strategy:
- keep the main performance comparison on a single pinned Kafka and Node.js version
- add at least one oldest-supported Kafka lane and one newest-supported Kafka lane for correctness and protocol regressions
Runner requirement:
- label the runner specifically for regression jobs, for example
self-hosted,linux,x64,kafka-regression - avoid sharing that runner with unrelated CI workloads
- keep background services on the host to a minimum
- allow only one workflow instance at a time on that dedicated runner
- queue later runs until the current run completes
- configure GitHub Actions concurrency so runs are serialized instead of cancelled
Purpose:
- fail immediately if the functional suite regresses
Actions:
- install dependencies
- start Kafka with Docker Compose
- run
pnpm run ci - upload test reports and logs as artifacts
Additional assertions to make explicit in this job family:
- no message gaps in recovery scenarios
- no duplicate delivery in recovery scenarios
- committed offsets advance to the expected next offset
- close and teardown complete without hung handles
Notes:
- this job largely exists already in
.github/workflows/ci.yml - keep it as the hard gate for merges
Purpose:
- detect stalled pipelines, unbounded buffering, and producer-side unwritable-node regressions
Actions:
- run the current targeted consumer backpressure tests
- add a dedicated producer backpressure regression suite around
acks=0, stream writes, and unwritable nodes - add a per-partition fairness scenario so one hot partition cannot starve others
- publish metrics artifact with:
- max
readableLength - processed messages per second while backpressured
- stall count / no-progress intervals
- partition-level fairness during mixed-load consumption
- max
Pass criteria:
- no stalls
- no unbounded growth in buffered messages
- no partition starvation beyond agreed thresholds
Purpose:
- detect memory leaks and resource drift under sustained pressure
Actions:
- run
npm run test:memory - add process-level sampling during test execution:
- Node heap used
- RSS
- CPU percent
- event loop delay
- persist the time series as artifacts
- compare the last samples with the baseline envelope from
main
Pass criteria:
- no monotonic leak pattern
- no envelope drift above agreed memory threshold
- CPU remains within an acceptable band for the same workload
Implementation note:
- because memory tests are currently excluded from CI, this job should run in this dedicated regression workflow on merge to
mainand on manual dispatch, not on every PR update
Purpose:
- detect throughput and latency regressions for produce and consume paths
Actions:
- create a benchmark harness around the existing scripts in
benchmarks/ - run at least these scenarios:
- single-message produce
- batched produce
- stream/evented consume
- normalize output into machine-readable JSON
- store benchmark results in S3 or a compatible object store and optionally expose a short summary in GitHub Actions
- compare against the latest
mainbaseline, not against raw historical averages
Metrics to track:
- messages/sec
- p50, p95, p99 latency
- bytes/sec
- CPU per 100k messages
- RSS / heap delta during benchmark window
Pass criteria:
- hard fail only on severe regression
- warn on moderate degradation
Recommended initial thresholds:
- fail if throughput degrades by more than 15%
- fail if latency worsens by more than 20%
- warn if throughput degrades by more than 8%
Purpose:
- detect version-sensitive protocol regressions and authentication hangs that may not show up in the main pinned lane
Actions:
- run a reduced correctness suite against the oldest supported Kafka version
- run the same reduced suite against the newest supported Kafka version
- run auth-focused smoke coverage for:
- SASL/OAUTHBEARER
- SCRAM
- GSSAPI when practical in this workflow
- upload broker and client logs for any failed auth or protocol lane
Pass criteria:
- no auth deadlocks
- no protocol-version-specific fetch, commit, or rebalance regressions
- no version-specific startup failures
The current suite is strong, but the following focused tests should be added so known incidents stay covered:
- committed-offset startup with
groupProtocol: 'consumer'under member-epoch churn- covers
STALE_MEMBER_EPOCHissue#267
- covers
- leader epoch refresh during long-running fetch loops
- covers
FENCED_LEADER_EPOCHissue#248
- covers
- rebalance while offsets are being refreshed
- covers issue
#223
- covers issue
- metadata refresh after bootstrap broker loss
- covers issue
#232
- covers issue
- SASL reauth under async token refresh
- covers issue
#226
- covers issue
- heavy-load deserialization with large batches and mixed payload sizes
- covers issue
#227
- covers issue
- consumer backpressure under
for awaitand pipeline-based consumption- covers issue
#260
- covers issue
- bandwidth amplification regression when
minBytesis set- covers issue
#99
- covers issue
- long-running batch consumer stall detection
- covers issue
#228
- covers issue
- offset commit correctness under manual commit, timed autocommit, and stream close
- message delivery invariants under reconnect, rebalance, stale epoch, and backpressure
- idempotent producer retry behavior and transactional correctness
- shutdown under load with no leaked timers or pending handles
- schema-registry consumption and deserialization under sustained load
- auth-path regression coverage for OAUTHBEARER, SCRAM, and GSSAPI
- tombstone and control-batch correctness
Every recovery or stress regression should assert the same core correctness properties, not only that the test finishes.
Required invariants:
- no message loss
- no duplicate delivery unless the scenario explicitly allows at-least-once duplicates
- committed offsets move forward exactly as expected
- consumers and producers remain closable after the scenario ends
- no hidden resource leak remains after shutdown
The workflow needs a stable baseline source.
Recommended approach:
- on every successful run, store JSON results, logs, and resource samples in S3 or a compatible object store
- keep a lightweight
regression-baselineobject containing the latest accepted benchmark and resource metrics for each lane - let later runs download those stored results and compare against the latest accepted baseline
- keep GitHub Actions artifacts only as short-lived convenience attachments for the current run
Stored objects should be organized by:
- workflow name
- commit SHA
- branch
- runner label
- Kafka version
- Node.js version
- timestamp
This storage strategy supports:
- historical retrieval
- baseline comparison
- regression triage
- re-running comparisons without depending on GitHub artifact retention windows
Performance results in GitHub-hosted environments can be noisy. Reduce noise by:
- pinning runner image, Node.js version, and Kafka version
- using a fixed topic layout, message sizes, partition counts, and warmup duration
- running each benchmark scenario multiple times and comparing medians
- separating correctness failures from performance warnings
- avoiding version matrices in the main performance comparison job
- separating the main perf lane from compatibility and auth lanes
This plan explicitly covers the main production-like failure classes already seen in repository history:
- rebalance offset races
- stale epoch handling
- metadata refresh resilience
- consumer backpressure regressions with buffering or stalls
- offset commit correctness
- duplicate/loss invariants during recovery
- idempotent and transactional producer reliability
- shutdown and cleanup stability
- auth-path reliability
- schema-registry load behavior
- cross-version compatibility
- tombstone and control-batch correctness
- reauthentication deadlocks
- heavy-load deserialization failures
- long-running batch-consumer stalls
- keep existing CI as the correctness gate
- keep this regression plan document as the implementation reference
- create
regression.ymlwith:- correctness smoke
- existing backpressure tests
- existing memory test on merge to
main - explicit invariant checks for duplicates, gaps, and committed offsets
- convert
benchmarks/*into JSON-producing scripts - add baseline download and comparison logic against the stored S3-compatible results
- publish GitHub Actions job summaries with trend deltas
- add reduced oldest/newest Kafka compatibility lanes
- add the missing issue-driven regression tests listed above
- expand this workflow with the remaining issue-driven scenarios
- add auth-heavy and schema-registry-heavy soak coverage to this workflow
.github/workflows/regression.yml- benchmark result schema such as
artifacts/regression/*.json - baseline compare script
- resource sampler script
- new regression tests for issues
#223,#226,#227,#228,#232,#248,#260,#267, and open enhancement#128 - invariant helpers for duplicate, gap, and committed-offset assertions
- reduced compatibility suite for oldest and newest supported Kafka versions
- auth-focused regression suite
- schema-registry load regression suite
- shutdown and leaked-handle regression suite