Skip to content

Commit 704897c

Browse files
committed
backend operator redesign doc
1 parent 82004ac commit 704897c

File tree

2 files changed

+335
-0
lines changed

2 files changed

+335
-0
lines changed
Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
16+
SPDX-License-Identifier: Apache-2.0
17+
-->
18+
19+
# Backend Operators Redesign
20+
21+
**Author**: [Xutong Ren](https://github.com/xutongNV), [Ethan Yu](https://github.com/ethany-nv)<br>
22+
**PIC**: [Xutong Ren](https://github.com/xutongNV)<br>
23+
**Proposal Issue**: [#147](https://github.com/nvidia/osmo/issues/147)
24+
25+
## Overview
26+
27+
This project aims to redesign the OSMO backend operator component to address critical scaling, reliability, and performance issues. The redesign will introduce horizontal scaling capabilities, improve multi-threading support, reduce memory footprint, and establish proper monitoring and testing infrastructure.
28+
29+
### Motivation
30+
31+
To prepare for future growth, OSMO's backend operators require architectural enhancements to support production workloads for Kubernetes backends at large scale. The current system provides a solid foundation, but can be optimized in several key areas: system stability under sustained load, event delivery consistency, and workflow status update latency.
32+
33+
This redesign will proactively strengthen OSMO's ability to support production-scale deployments reliably and efficiently as customer workloads continue to grow. Rewriting the backend listener using Golang will allow us to leverage native Kubernetes Go library for more performant operations and more built-in features (such as node/pod events caching).
34+
35+
### Problem
36+
37+
**Scaling**
38+
- Agent Serice cannot scale horizontally to process backend messages
39+
40+
**Reliability**
41+
- Backend Listener frequently restarts, impacting system availability
42+
43+
**Performance**
44+
- Memory leaks in the listener component consume significant resources
45+
- Workflow status update latency blocking user from exec and port forwarding
46+
- Single-threaded listener design limits message throughput
47+
48+
**Observability**
49+
- Lack of proper KPIs and logging infrastructure
50+
- Insufficient test coverage makes it difficult to diagnose issues and measure system health
51+
52+
## Use Cases
53+
54+
| Use Case | Description |
55+
|---|---|
56+
| Large computing backend | Backend operator needs to be able to handle a large computing backends, e.g. a k8s cluster with 2k nodes |
57+
| User exec into their task | A user can exec into his task as soon as the task becomes running |
58+
59+
## Requirements
60+
61+
| Title | Description | Type |
62+
|---|---|---|
63+
| Scalability | Agent service shall handle multiple backend messages in parallel and be able to horizontally scale | Scalability |
64+
| Event Caching | Backend operator shall cache Kubernetes events to reduce API server load and improve response times | Performance |
65+
| Message Compacting | Agent server shall compact messages and deduplicate stale message to improve processing times | Performance |
66+
| Data Accuracy | Backend operator shall provide accurate and up-to-date workflow and resource status information | Reliability |
67+
| Low Latency | Backend operator shall update workflow status with minimal latency to enable timely user actions (exec, port forwarding) | Performance |
68+
| Recoverability | Backend operator shall be able to recover from failures without data loss | Reliability |
69+
| Traceability | Backend operator shall provide structured logging for debugging and auditing purposes | Observability |
70+
| KPI Dashboard | Backend operator shall expose KPIs and metrics for monitoring system health and outage alarts| Observability |
71+
| No Message Drop | Backend operator shall ensure no message drops | Reliability |
72+
| Resource Efficiency | Backend operator shall utilize CPU efficiently and prevent memory leaks | Performance |
73+
| System Robustness | Backend operator shall operate without frequent restarts and maintain stability under load | Reliability |
74+
75+
## Architectural Details
76+
77+
#### Current Architecture
78+
79+
```mermaid
80+
flowchart LR
81+
subgraph CB["Compute Backend"]
82+
BL["Backend Listener"]
83+
end
84+
85+
Agent["Agent Service"]
86+
87+
BL <-->|WebSocket| Agent
88+
```
89+
90+
**Component Description:**
91+
- **Backend Listener**: Listens to pod, node, backend events across ALL namespaces
92+
- **Agent Service**: Receives backend messages and processes them in order
93+
94+
#### Proposed Architecture
95+
96+
```mermaid
97+
flowchart LR
98+
Worker["Worker"]
99+
Redis["Message Queue<br/>(Redis)"]
100+
Agent["Operator Service"]
101+
102+
subgraph CB["Compute Backend"]
103+
WL["Workflow Listener"]
104+
RL["Resource Listener"]
105+
EL["Event Listener"]
106+
end
107+
108+
Agent --> Redis
109+
Redis --> Worker
110+
WL <-->|gRPC| Agent
111+
RL <-->|gRPC| Agent
112+
EL <-->|gRPC| Agent
113+
```
114+
115+
**Component Description:**
116+
117+
- **Workflow Listener**: Listens to pod status changes in OSMO namespace
118+
- **Resource Listener**: Listens to pod, node events and aggregrating resource usages
119+
- **Event Listener**: Listens to scheduler events
120+
- **Operator (renamed from Agent) Service**: Receives backend messages, puts them in the message queue and acknoledges
121+
- **Message Queue (Redis)**: Message broker for compacting and asynchronous processing
122+
- **Worker**: Task executor that processes messages from the queue
123+
124+
## Detailed Design
125+
126+
### Decouple Backend Listener (Golang)
127+
128+
(TODO) decide multiple executables or just coroutines
129+
130+
Current Backend Listener does the following things as multiple threads / coroutines in a single Python script:
131+
132+
- **4 Threads** watching Kubernetes resources:
133+
- `watch_pod_events` - Monitors pod status changes in ALL namespaces
134+
- `watch_node_events` - Monitors node events across cluster
135+
- `watch_backend_events` - Monitors cluster-level scheduler events
136+
- `get_service_control_updates` - Receives control messages from Agent service (e.g., node condition rules)
137+
- **6 Coroutines** managing WebSocket communication:
138+
- `heartbeat` - Sends periodic heartbeat messages
139+
- 5 `websocket_connect` coroutines maintaining separate WebSocket connections for CONTROL, POD, NODE, EVENT, and HEARTBEAT message types
140+
141+
Due to the complexity of Backend Listener's functionalities and limitations of Python (GIL preventing true parallelism), we observe low CPU utilization, message processing delays, and difficulty scaling to large Kubernetes clusters.
142+
143+
The proposed architecture divided Backend Listener into 3 aspects: Updating OSMO pod status (most important), aggregating resource usages and stream scheduling events.
144+
145+
Decoupling Backend Listener into separate Go-based services leverages Go's goroutines to achieve true parallelism and higher CPU utilization, simplifies management by giving each listener a single well-defined responsibility, and most importantly, gives workflow status updates prioritization so that they are no longer blocked by high-volume resource/node messages.
146+
147+
### Deprecate Backend Heartbeat
148+
149+
Currently Backend Listener sends heartbeats following a schedule to Agent Service. Agent Service updates Database with `last_heatbeat` to determine whether a backend is marked as `online`. Based on the following considerations the new resign will deprecate the heartbeat message and backend online status:
150+
151+
- Heartbeat is not reliablly sent and its frequency requires tunning for different compute backend sizes due to current implementation: loads in other threads cause the effective frequency much lower than expected
152+
- Current backend online status is not accurate since it does not take Backend Worker into consideration
153+
- Workflows are not blocked if a backend is offline
154+
155+
Therefore, this design removes the needs of backend heartbeat messages and backend online status, and establishes the following:
156+
157+
- Each backend container has its own liveness prob and alarts if the pod errors or terminates (**already acheived**)
158+
- Pool online status changes to a flag of whether the pool is in maintaince or not. OSMO service blocks workflow submissions based on it.
159+
160+
### Cache Messages on Backend Side
161+
162+
Caching on the backend side reduces the amount of messages Backend Listener has to send. The current implementation of cacheing bases on dropping duplicated events that has been seen within a time window. However, this sometimes causes memory leak in Backend Listener.
163+
164+
Golang has native support of cahcing in its k8s library. It drops duplicated events as well as stale events (compacts events).
165+
166+
### Decouple Agent Service
167+
168+
Current Agent Service handles multiple responsibilities as asynchronized funcitons in a single Python script:
169+
170+
- Connects to Backend Worker and executes Backend Jobs
171+
- Connects to Backend Listener for CONTROL, POD, NODE, EVENT, and HEARTBEAT message types
172+
173+
In the proposed design, the Backend Worker endpoint will be split from Backend Listener endpoints. This separation allows Operator (Agent) Service to scale independently based on workload type, improves CPU and memory management, and enables parallel processing of listener messages and job execution tasks.
174+
175+
### gRPC connections between Backend Listeners and Operator Services
176+
177+
Using gRPC instead of WebSocket to better handle back-pressure, which occurs frequently since the current Agent Service processes messages sequentially. gRPC also provides more reliable connections and lower latency compared to WebSocket.
178+
179+
This design retains the current ACK mechanism, which was initially designed to handle back-pressure. The ACK mechanism also prevents message loss in case of Operator Service failures without requiring a full resync of all backend events.
180+
181+
The real bottleneck of the status update latency is not the networking protocol, but the speed at which the Operator Service processes messages. The key to improving latency is making the Operator Service handle messages asynchronously.
182+
183+
```mermaid
184+
---
185+
title: Current Flow
186+
---
187+
flowchart LR
188+
Backend["Backend"]
189+
190+
subgraph AgentService["Agent Service"]
191+
Receiver["Receiver Coroutine"]
192+
Queue["Message Queue<br/>(In Memory)"]
193+
Worker["Worker Coroutine"]
194+
195+
Receiver -->|2| Queue
196+
Queue -->|3| Worker
197+
end
198+
199+
Backend -->|1| Receiver
200+
Worker -->|4 ACK| Backend
201+
```
202+
203+
```mermaid
204+
---
205+
title: Proposed Flow
206+
---
207+
flowchart LR
208+
Backend["Backend"]
209+
Agent["Operator Service"]
210+
Redis["Message Queue<br/>(Redis)"]
211+
Worker1["Worker"]
212+
Worker2["Worker"]
213+
214+
Backend -->|1| Agent
215+
Agent -->|3 ACK| Backend
216+
Agent -->|2| Redis
217+
Redis -->|4| Worker1
218+
Redis -->|4| Worker2
219+
```
220+
221+
### Operator Service with Asynchronous Processing
222+
223+
The real bottleneck of status update delays and lack of scalability is the sequential processing mechanism of Agent Service.
224+
Due to implementation limitations, Agent Service processes messages in the receiving order sequentially. The sequentiality is also a correctness requirement since the messages do not capture time information except the order.
225+
Agent Service will hold a certain amount of unprocessed messages in a memory queue and only ACK a message when it is processed. This also causes messages to pile up on the backend side and leads to memory increase over time.
226+
227+
To make it scalable and parallel, this design introduces a message queue. Operator Service now receives a message, puts it into the queue, and acks it. This should be completed in a fairly small amount of time and alleviates the back pressure of the connection, reducing message congestion on the backend side. Backend Listeners do not need to hold unacked messages until they are processed. On the other side, multiple workers can work on processing messages in parallel, and this can be scaled up based on needs.
228+
229+
### Message Queue
230+
231+
To ensure correctness of the asynchronous workers, message timestamps are required from the backend. With timestamps, workers can determine the most recent state and discard stale updates when processing messages in parallel.
232+
233+
The message queue can be equipped with compacting and deduplication mechanisms to reduce the number of unprocessed messages. For multiple updates to the same resource, only the most recent message needs to be processed, while earlier messages can be safely discarded as they represent intermediate states.
234+
235+
Different message types are associated with different queues, and workers can prioritize one queue over others (e.g., workflow status updates).
236+
237+
### Backwards Compatibility
238+
239+
No. Service and Backend Operator have to be upgraded to the same version
240+
241+
### Documentation
242+
243+
- [ ] Update OSMO Deployment Guide
244+
245+
### Testing
246+
247+
TODO
248+
_What unit, integration, or end-to-end tests need to be created or updated? How will these tests be integrated in automation? What test metrics will be tracked and what are KPIs?_
249+
250+
## Implementation Plan
251+
252+
### Phase 1: Workflow Listener
253+
254+
**Milestone**: Split Workflow Listener out of Backend Listener
255+
256+
**Tasks**:
257+
- [ ] Create Workflow Listener in Go
258+
- [ ] Create Operator Service endpoint seperate from current Agent Service
259+
- [ ] Setup networking and message type proto between those
260+
- [ ] Setup basic Redis Message Queue
261+
- [ ] Setup worker
262+
263+
**Deliverables**:
264+
- [ ] Functional Workflow Listener cooperated that can work alongside the current architecture
265+
266+
**Tests & Benchmarking**:
267+
- [ ] Unit tests for networking and message queue
268+
- [ ] Benchmark for update latency etc
269+
- [ ] Load test for large backend cluster
270+
271+
---
272+
273+
### Phase 2: Resource Listener and Event Listener
274+
275+
**Milestone**: Split the rest listeners
276+
277+
**Tasks**:
278+
- [ ] Create Resource Listener
279+
- [ ] Create Event Listener
280+
281+
**Deliverables**:
282+
- [ ] Resource Listener
283+
- [ ] Event Listener
284+
285+
**Tests & Benchmarking**:
286+
- [ ] Unit tests for networking and message queue
287+
- [ ] Benchmark for update latency etc
288+
- [ ] Load test for large backend cluster
289+
- [ ] End-to-end test on real backend clusters
290+
291+
## Open Questions
292+
293+
- [ ] Is Agent service needed? Can Backend Listeners connects to the Message Queue?
294+
- [ ] Sending logs to Agent Service and print them out is a tricky implementation. Can we using service sidecar for Backend Operator logs?
295+
- [ ] Make use of k8s API metrics server to compute resource usage
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
16+
SPDX-License-Identifier: Apache-2.0
17+
-->
18+
19+
# Operator Redesign Benchmarks
20+
21+
## Python + WebSocket vs Go + gRPC
22+
23+
### Overview
24+
25+
This section presents comprehensive performance benchmarks comparing Python WebSocket implementation against Go gRPC implementation across different message volumes.
26+
27+
**Test Configuration:**
28+
- Client sends message of type `UpdatePod`, Service receives the message and ACKs immediately
29+
- Test scenarios: 50, 100, and 500 messages
30+
31+
### Benchmark Results
32+
33+
| #Messages | Language (Protocol) | Total Time | Avg Latency | Min Latency | Max Latency | Throughput (msg/sec) | Speed Improvement | Latency Improvement |
34+
|-----------|---------------------|------------|-------------|-------------|-------------|---------------------|-------------------|---------------------|
35+
| **50** | Python (WebSocket) | 172.44 ms | 1.29 ms | 811.66 µs | 1.92 ms | 289.96 | - | - |
36+
| | Go (gRPC) | 72.82 ms | 308.60 µs | 203.72 µs | 1.08 ms | 686.67 | **2.37x** | **4.19x** |
37+
| **100** | Python (WebSocket) | 267.45 ms | 1.48 ms | 986.87 µs | 2.42 ms | 373.90 | - | - |
38+
| | Go (gRPC) | 56.03 ms | 292.65 µs | 201.75 µs | 1.17 ms | 1784.70 | **4.77x** | **5.06x** |
39+
| **500** | Python (WebSocket) | 886.92 ms | 1.55 ms | 921.82 µs | 2.97 ms | 563.75 | - | - |
40+
| | Go (gRPC) | 195.52 ms | 297.40 µs | 160.89 µs | 1.34 ms | 2557.24 | **4.54x** | **5.20x** |

0 commit comments

Comments
 (0)