Skip to content

feat(log-collector): add per-pod K8s events collector#3037

Open
ygrishajev wants to merge 1 commit intomainfrom
feat/log-collector-events-collector
Open

feat(log-collector): add per-pod K8s events collector#3037
ygrishajev wants to merge 1 commit intomainfrom
feat/log-collector-events-collector

Conversation

@ygrishajev
Copy link
Copy Markdown
Contributor

@ygrishajev ygrishajev commented Apr 2, 2026

Why

Closes CON-1

The log-collector captures pod logs but not Kubernetes events. Events like scheduling failures, image pull errors, OOMKills, and back-off restarts are critical observability signals that are currently lost.

What

  • Add K8sEventsCollectorService that watches K8s events per pod via the Watch API
  • Add PodEventsCollectorFactory for per-pod event collector instances
  • Events are written as JSON lines to the same per-pod log file alongside container logs
  • Fluent Bit picks up events and forwards to Datadog (JSON auto-parsed)
  • Graceful 403 handling: if events watch is forbidden, logs continue
  • Non-403 errors propagate to bootstrap and crash (K8s restarts)
  • Add watch verb to events RBAC role
  • Add e2e tests (vitest + K8s SDK) for local testing — not in CI, requires make dev and a local K8s cluster
  • Add architecture docs with diagram
  • Update README

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

The changes extend the log collector to simultaneously collect Kubernetes events alongside container logs. A new K8sEventsCollectorService and accompanying factory are introduced, the existing K8sCollectorService is updated to run event and log collection concurrently per pod, RBAC permissions are expanded to allow watch operations on events, and comprehensive E2E and unit tests are added.

Changes

Cohort / File(s) Summary
Documentation & Architecture
apps/log-collector/README.md, apps/log-collector/docs/ARCHITECTURE.md
Updated project description to reflect log and event collection; added new architecture document describing system components, pod discovery patterns, parallel log/event flows, error handling, and fallback behaviors.
Kubernetes RBAC Configuration
apps/log-collector/k8s/role.yaml
Extended events resource permissions from ["get", "list"] to include ["get", "list", "watch"] to enable event watching capability.
Test Infrastructure Setup
apps/log-collector/package.json, apps/log-collector/vitest.e2e.config.ts, apps/log-collector/test/seeders/kubernetes-event.seeder.ts
Added test:e2e npm script, @dotenvx/dotenvx dev dependency, E2E Vitest configuration with 120s test timeout, and Kubernetes event test data generator.
Event Collection Core Implementation
apps/log-collector/src/services/k8s-events-collector/k8s-events-collector.service.ts, apps/log-collector/src/factories/pod-events-collector/pod-events-collector.factory.ts
Implemented new K8sEventsCollectorService with watch-based event streaming, JSON-formatted output, resource version tracking for reconnection, and 403-forbidden fallback handling; added singleton factory to instantiate service with DI-resolved dependencies.
Service Integration & Error Handling
apps/log-collector/src/services/k8s-collector/k8s-collector.service.ts, apps/log-collector/src/bootstrap/bootstrap.ts, apps/log-collector/src/index.ts
Modified K8sCollectorService to create and run event collection in parallel with log collection per pod, removed nodeProcess.exit() termination in favor of error propagation, and updated documentation comments to reflect combined log and event collection.
Unit & Type Assertion Updates
apps/log-collector/src/services/file-destination/file-destination.service.spec.ts, apps/log-collector/src/services/k8s-collector/k8s-collector.service.spec.ts, apps/log-collector/src/services/k8s-events-collector/k8s-events-collector.service.spec.ts
Added explicit TypeScript type assertions in mock implementations; updated K8sCollectorService tests to verify concurrent event and log collection per pod and assert POD_COLLECTION_FAILED event on error; added comprehensive K8sEventsCollectorService spec covering watch lifecycle, JSON formatting, forbidden error handling, reconnection with resource version, and abort signal termination.
End-to-End Test Suite
apps/log-collector/test/e2e/events-collector.e2e.ts
Added E2E test suite validating scheduled/started event capture, event collection continuation after pod restarts, concurrent log/event forwarding, optional Datadog delivery verification, and event collection resilience when RBAC watch permission is restricted.

Sequence Diagram(s)

sequenceDiagram
    participant PC as Pod Discovery
    participant KCS as K8sCollectorService
    participant PECF as PodEventsCollectorFactory
    participant KECS as K8sEventsCollectorService
    participant PLCF as PodLogsCollectorFactory
    participant PLCS as PodLogsCollectorService
    participant Watch as Kubernetes Watch API
    participant FD as FileDestinationService

    PC->>KCS: Pod detected
    KCS->>PECF: create(podInfo, fileDestination, signal)
    PECF->>KECS: instantiate with DI deps
    KCS->>PLCF: create(podInfo, fileDestination, signal)
    PLCF->>PLCS: instantiate with DI deps
    
    par Event Collection
        KECS->>Watch: watch(events, fieldSelector=podName)
        Watch-->>KECS: event stream
        KECS->>KECS: format JSON line (timestamp, reason, phase)
        KECS->>FD: write(jsonEvent)
    and Log Collection
        PLCS->>FD: stream logs from pod
        FD-->>FD: rotate & write logs
    end
    
    PC->>KCS: Pod deleted
    KCS->>KECS: abort signal triggered
    KCS->>PLCS: abort signal triggered
    KECS-->>FD: close stream
    PLCS-->>FD: close stream
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hops of joy, the collector now sees,
Both logs and events dance on the breeze,
Twin streams of data, in parallel flow,
Events and logs steal the show! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately and specifically describes the main feature addition: implementing a per-pod Kubernetes events collector, which is the primary change across all modified files.
Description check ✅ Passed The PR description clearly explains the motivation, specific changes, and behavior of the new feature, matching the template's required sections.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/log-collector-events-collector

Comment @coderabbitai help to get the list of available commands and usage tips.


describe("Logs and Events Collector E2E", () => {
beforeAll(() => {
execSync(`kubectl apply -f ${K8S_DIR}`, { stdio: "ignore" });

Check warning

Code scanning / CodeQL

Shell command built from environment values Medium test

This shell command depends on an uncontrolled
absolute path
.
{ timeout: 15_000, interval: 2000 }
);
} finally {
execSync(`kubectl apply -f ${K8S_DIR}/role.yaml`, { stdio: "ignore" });

Check warning

Code scanning / CodeQL

Shell command built from environment values Medium test

This shell command depends on an uncontrolled
absolute path
.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 88.13559% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.40%. Comparing base (eed53d2) to head (8664f4b).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...d-events-collector/pod-events-collector.factory.ts 0.00% 5 Missing ⚠️
...s-events-collector/k8s-events-collector.service.ts 95.83% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3037      +/-   ##
==========================================
- Coverage   59.61%   59.40%   -0.22%     
==========================================
  Files        1034     1010      -24     
  Lines       24242    23900     -342     
  Branches     6012     5969      -43     
==========================================
- Hits        14453    14198     -255     
+ Misses       8539     8456      -83     
+ Partials     1250     1246       -4     
Flag Coverage Δ *Carryforward flag
api 81.24% <ø> (ø) Carriedforward from eed53d2
deploy-web 43.22% <ø> (ø) Carriedforward from eed53d2
log-collector 85.12% <88.13%> (+0.20%) ⬆️
notifications 86.06% <ø> (ø) Carriedforward from eed53d2
provider-console 81.48% <ø> (ø) Carriedforward from eed53d2
provider-proxy 85.21% <ø> (ø) Carriedforward from eed53d2
tx-signer ?

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
apps/log-collector/src/bootstrap/bootstrap.ts 81.81% <ø> (ø)
...rc/services/k8s-collector/k8s-collector.service.ts 100.00% <100.00%> (ø)
...s-events-collector/k8s-events-collector.service.ts 95.83% <95.83%> (ø)
...d-events-collector/pod-events-collector.factory.ts 0.00% <0.00%> (ø)

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/log-collector/src/services/k8s-collector/k8s-collector.service.ts (1)

40-56: ⚠️ Potential issue | 🟠 Major

Concurrent writes to shared stream will cause interleaved JSON output.

Both collectPodLogs() and collectPodEvents() call createWriteStream() on the same fileDestination instance, which returns the same memoized PassThrough stream. Both collectors then write to it concurrently via Promise.all. Node.js WritableStream.write() is not atomic—concurrent writes can interleave, corrupting the JSON lines in the output file.

Consider one of these approaches:

  1. Use separate file destinations per collector (e.g., _logs.log and _events.log)
  2. Serialize writes through a mutex or queue
  3. Buffer complete lines before writing, ensuring atomic operations
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/log-collector/src/services/k8s-collector/k8s-collector.service.ts`
around lines 40 - 56, startCollectionForPod currently creates a single
fileDestination and passes it to both collectPodLogs() and collectPodEvents(),
causing both to call createWriteStream() on the same memoized PassThrough and
produce interleaved JSON; fix by giving each collector its own independent
destination (e.g., create a separate fileDestination/logs and
fileDestination/events via this.fileDestinationFactory.create or extend the
factory to produce distinct child destinations) and pass those to
podLogsCollector.collectPodLogs() and podEventsCollector.collectPodEvents()
instead of sharing one, or alternatively implement a serialization layer
(mutex/queue) inside the fileDestination.createWriteStream implementation so
writes from collectPodLogs and collectPodEvents are serialized and cannot
interleave.
🧹 Nitpick comments (1)
apps/log-collector/test/e2e/events-collector.e2e.ts (1)

189-205: Consider adding error handling for Datadog API responses.

The queryDatadog function assumes a successful JSON response structure. A non-2xx response or malformed JSON could cause confusing test failures.

🔧 Optional: Add response status check
       const response = await fetch(`https://api.${DD_SITE}/api/v2/logs/events/search`, {
         method: "POST",
         headers: { "Content-Type": "application/json", "DD-API-KEY": apiKey, "DD-APPLICATION-KEY": appKey },
         body: JSON.stringify({
           filter: { query, from: "now-5m", to: "now" },
           page: { limit: 10 },
           sort: "-timestamp"
         })
       });

+      if (!response.ok) {
+        throw new Error(`Datadog API error: ${response.status} ${response.statusText}`);
+      }
+
       return (await response.json()) as { data: Array<{ attributes: Record<string, unknown> }> };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/log-collector/test/e2e/events-collector.e2e.ts` around lines 189 - 205,
The queryDatadog function assumes a successful JSON response; add robust
response/error handling in queryDatadog: after the fetch, check response.ok and
if false read response.text() and throw a descriptive Error including
response.status, response.statusText and the response body; wrap the JSON parse
in try/catch and on parse failure throw an Error including the raw response text
and the original parse error; keep the returned shape the same when successful.
This change should be applied inside the queryDatadog function to ensure non-2xx
responses and malformed JSON produce clear, actionable errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@apps/log-collector/src/services/k8s-events-collector/k8s-events-collector.service.ts`:
- Around line 46-87: The AsyncChannel's internal queue is unbounded and can grow
if Kubernetes emits events faster than watchEvents consumes them; update the
implementation to provide backpressure (e.g., add a maxSize parameter to
AsyncChannel and either block producers when full or apply an overflow policy
like dropping oldest/newest) and use that in watchEvents (the channel created in
watchEvents) so the watcher callback either awaits when the channel is full or
drops events according to the chosen policy; ensure the watcher callback (passed
into this.watch.watch) handles the async backpressure (or checks channel.push
result) and that the for-await consumer in watchEvents continues to drain the
bounded channel to avoid unbounded memory growth.

---

Outside diff comments:
In `@apps/log-collector/src/services/k8s-collector/k8s-collector.service.ts`:
- Around line 40-56: startCollectionForPod currently creates a single
fileDestination and passes it to both collectPodLogs() and collectPodEvents(),
causing both to call createWriteStream() on the same memoized PassThrough and
produce interleaved JSON; fix by giving each collector its own independent
destination (e.g., create a separate fileDestination/logs and
fileDestination/events via this.fileDestinationFactory.create or extend the
factory to produce distinct child destinations) and pass those to
podLogsCollector.collectPodLogs() and podEventsCollector.collectPodEvents()
instead of sharing one, or alternatively implement a serialization layer
(mutex/queue) inside the fileDestination.createWriteStream implementation so
writes from collectPodLogs and collectPodEvents are serialized and cannot
interleave.

---

Nitpick comments:
In `@apps/log-collector/test/e2e/events-collector.e2e.ts`:
- Around line 189-205: The queryDatadog function assumes a successful JSON
response; add robust response/error handling in queryDatadog: after the fetch,
check response.ok and if false read response.text() and throw a descriptive
Error including response.status, response.statusText and the response body; wrap
the JSON parse in try/catch and on parse failure throw an Error including the
raw response text and the original parse error; keep the returned shape the same
when successful. This change should be applied inside the queryDatadog function
to ensure non-2xx responses and malformed JSON produce clear, actionable errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 654cfa46-860f-40c8-916b-92c303105310

📥 Commits

Reviewing files that changed from the base of the PR and between 59a50eb and 8664f4b.

⛔ Files ignored due to path filters (1)
  • apps/log-collector/docs/architecture.png is excluded by !**/*.png
📒 Files selected for processing (15)
  • apps/log-collector/README.md
  • apps/log-collector/docs/ARCHITECTURE.md
  • apps/log-collector/k8s/role.yaml
  • apps/log-collector/package.json
  • apps/log-collector/src/bootstrap/bootstrap.ts
  • apps/log-collector/src/factories/pod-events-collector/pod-events-collector.factory.ts
  • apps/log-collector/src/index.ts
  • apps/log-collector/src/services/file-destination/file-destination.service.spec.ts
  • apps/log-collector/src/services/k8s-collector/k8s-collector.service.spec.ts
  • apps/log-collector/src/services/k8s-collector/k8s-collector.service.ts
  • apps/log-collector/src/services/k8s-events-collector/k8s-events-collector.service.spec.ts
  • apps/log-collector/src/services/k8s-events-collector/k8s-events-collector.service.ts
  • apps/log-collector/test/e2e/events-collector.e2e.ts
  • apps/log-collector/test/seeders/kubernetes-event.seeder.ts
  • apps/log-collector/vitest.e2e.config.ts

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

The changes extend the log collector to simultaneously collect Kubernetes events alongside container logs. A new K8sEventsCollectorService and accompanying factory are introduced, the existing K8sCollectorService is updated to run event and log collection concurrently per pod, RBAC permissions are expanded to allow watch operations on events, and comprehensive E2E and unit tests are added.

Changes

Cohort / File(s) Summary
Documentation & Architecture
apps/log-collector/README.md, apps/log-collector/docs/ARCHITECTURE.md
Updated project description to reflect log and event collection; added new architecture document describing system components, pod discovery patterns, parallel log/event flows, error handling, and fallback behaviors.
Kubernetes RBAC Configuration
apps/log-collector/k8s/role.yaml
Extended events resource permissions from ["get", "list"] to include ["get", "list", "watch"] to enable event watching capability.
Test Infrastructure Setup
apps/log-collector/package.json, apps/log-collector/vitest.e2e.config.ts, apps/log-collector/test/seeders/kubernetes-event.seeder.ts
Added test:e2e npm script, @dotenvx/dotenvx dev dependency, E2E Vitest configuration with 120s test timeout, and Kubernetes event test data generator.
Event Collection Core Implementation
apps/log-collector/src/services/k8s-events-collector/k8s-events-collector.service.ts, apps/log-collector/src/factories/pod-events-collector/pod-events-collector.factory.ts
Implemented new K8sEventsCollectorService with watch-based event streaming, JSON-formatted output, resource version tracking for reconnection, and 403-forbidden fallback handling; added singleton factory to instantiate service with DI-resolved dependencies.
Service Integration & Error Handling
apps/log-collector/src/services/k8s-collector/k8s-collector.service.ts, apps/log-collector/src/bootstrap/bootstrap.ts, apps/log-collector/src/index.ts
Modified K8sCollectorService to create and run event collection in parallel with log collection per pod, removed nodeProcess.exit() termination in favor of error propagation, and updated documentation comments to reflect combined log and event collection.
Unit & Type Assertion Updates
apps/log-collector/src/services/file-destination/file-destination.service.spec.ts, apps/log-collector/src/services/k8s-collector/k8s-collector.service.spec.ts, apps/log-collector/src/services/k8s-events-collector/k8s-events-collector.service.spec.ts
Added explicit TypeScript type assertions in mock implementations; updated K8sCollectorService tests to verify concurrent event and log collection per pod and assert POD_COLLECTION_FAILED event on error; added comprehensive K8sEventsCollectorService spec covering watch lifecycle, JSON formatting, forbidden error handling, reconnection with resource version, and abort signal termination.
End-to-End Test Suite
apps/log-collector/test/e2e/events-collector.e2e.ts
Added E2E test suite validating scheduled/started event capture, event collection continuation after pod restarts, concurrent log/event forwarding, optional Datadog delivery verification, and event collection resilience when RBAC watch permission is restricted.

Sequence Diagram(s)

sequenceDiagram
    participant PC as Pod Discovery
    participant KCS as K8sCollectorService
    participant PECF as PodEventsCollectorFactory
    participant KECS as K8sEventsCollectorService
    participant PLCF as PodLogsCollectorFactory
    participant PLCS as PodLogsCollectorService
    participant Watch as Kubernetes Watch API
    participant FD as FileDestinationService

    PC->>KCS: Pod detected
    KCS->>PECF: create(podInfo, fileDestination, signal)
    PECF->>KECS: instantiate with DI deps
    KCS->>PLCF: create(podInfo, fileDestination, signal)
    PLCF->>PLCS: instantiate with DI deps
    
    par Event Collection
        KECS->>Watch: watch(events, fieldSelector=podName)
        Watch-->>KECS: event stream
        KECS->>KECS: format JSON line (timestamp, reason, phase)
        KECS->>FD: write(jsonEvent)
    and Log Collection
        PLCS->>FD: stream logs from pod
        FD-->>FD: rotate & write logs
    end
    
    PC->>KCS: Pod deleted
    KCS->>KECS: abort signal triggered
    KCS->>PLCS: abort signal triggered
    KECS-->>FD: close stream
    PLCS-->>FD: close stream
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hops of joy, the collector now sees,
Both logs and events dance on the breeze,
Twin streams of data, in parallel flow,
Events and logs steal the show! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately and specifically describes the main feature addition: implementing a per-pod Kubernetes events collector, which is the primary change across all modified files.
Description check ✅ Passed The PR description comprehensively covers the Why (closes CON-1, explains the importance of K8s events) and What (lists all major components, behaviors, security changes, and documentation updates) required by the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/log-collector-events-collector

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants