feat(BA-992): Offload health check capability to AppProxy #5147

kyujin-cho · 2025-07-15T09:38:10Z

Summary

Refactor model service health check architecture to offload health checking to AppProxy
Update API specifications for endpoint route management and health check configuration
Implement separate model entity handling for health check configuration
Add support for reading health check information from model-definition.yaml

Breaking Changes

Important: This PR breaks health check capability on OSS AppProxy. Future work will restore support, but for now disable the health check feature in model-definition.yaml to use model service on Open Source Backend.AI.

Changes

API Specifications

Updated model serving event types with new ModelServiceStatusEventArgs base class
Added EndpointRouteListUpdatedEvent for endpoint route synchronization
Refactored anycast/broadcast event structures for better separation of concerns

Health Check System

Added ModelServiceHelper class replacing ModelServicePredicateChecker
Implemented health check configuration reading from model-definition.yaml
Added health check configuration passing to AppProxy during endpoint creation
Separated model definition validation into discrete functions

Route Management

Updated endpoint route generation to use Redis for AppProxy communication
Modified session lifecycle handling for route status updates
Added generate_redis_route_info method for serializable connection data
Implemented notify_endpoint_route_update_to_appproxy for real-time updates

Database Changes

Enhanced endpoint model with Redis route info generation for AppProxy integration
Updated routing status management in session callbacks
Improved error handling and retry logic for route operations

Technical Details

The health check system now reads configuration from either:

Runtime variant profiles for predefined endpoints
model-definition.yaml for custom runtime variants

Health check configuration is passed to AppProxy during endpoint creation, allowing AppProxy to handle health checking independently. Route connection information is stored in Redis for AppProxy consumption with the key pattern:
endpoint.{endpoint_id}.route_connection_info

Testing

Existing tests should continue to pass. New functionality requires model service integration testing with AppProxy components.

Copilot

Pull Request Overview

This PR refactors model service health checks by offloading them to AppProxy, updates route management via Redis, and replaces the old predicate checker with a unified helper.

Introduce health check configuration reading and pass it to AppProxy
Refactor ModelServicePredicateChecker into ModelServiceHelper with consolidated validation methods
Persist route connection info in Redis (redis_live) and emit real-time update events

Reviewed Changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/ai/backend/manager/services/processors.py	Add `redis_live` to `ServiceArgs`
src/ai/backend/manager/services/model_serving/types.py	Define `RouteConnectionInfo` dataclass
src/ai/backend/manager/services/model_serving/services/model_serving.py	Integrate `redis_helper`, health checks, notifications
src/ai/backend/manager/registry.py	Add `get_health_check_info`, `create_appproxy_endpoint`, `notify_endpoint_route_update_to_appproxy`
src/ai/backend/manager/models/endpoint.py	Implement `generate_redis_route_info`, rename helper
src/ai/backend/manager/models/routing.py	Update default `status_filter`, add `load_session`
src/ai/backend/manager/models/storage.py	Raise `VFolderGone` on 410 response
src/ai/backend/manager/models/vfolder.py	Catch `VFolderGone` in deletion
src/ai/backend/common/data/config/types.py	Add `HealthCheckConfig` Pydantic model

Comments suppressed due to low confidence (4)

src/ai/backend/manager/registry.py:3659

[nitpick] Consider adding a docstring to get_health_check_info to explain how it selects or reads health check settings for different runtime variants, including fallback behavior.

    async def get_health_check_info(

src/ai/backend/manager/services/model_serving/types.py:64

[nitpick] The field name app in RouteConnectionInfo is ambiguous. Consider renaming it to app_name or service_name for clarity.

class RouteConnectionInfo:

src/ai/backend/manager/registry.py:3692

Consider adding unit or integration tests for create_appproxy_endpoint and get_health_check_info to verify correct serialization of health check configs, Redis writes, and error handling.

    async def create_appproxy_endpoint(

src/ai/backend/manager/models/routing.py:125

Changing the default status_filter to include all active_route_statuses may change behavior for callers that relied on the previous default (HEALTHY, UNHEALTHY, PROVISIONING). Consider explicitly specifying the intended filter in existing call sites or updating tests.

        status_filter: list[RouteStatus] = list(RouteStatus.active_route_statuses()),

Copilot · 2025-07-15T09:39:54Z

src/ai/backend/manager/services/model_serving/services/model_serving.py

+                lambda r: r.set(
+                    f"endpoint.{action.service_id}.session.{route.session}.traffic_ratio",
+                    str(action.traffic_ratio),
+                    ex=3600,


The TTL value 3600 is hardcoded in multiple places. Extract this into a named constant or configuration parameter to improve maintainability.

Suggested change

ex=3600,

ex=DEFAULT_TTL_SECONDS,

Copilot · 2025-07-15T09:39:55Z

src/ai/backend/manager/registry.py

+    ) -> HealthCheckConfig | None:
+        _info: HealthCheckConfig | None = None
+
+        if _path := MODEL_SERVICE_RUNTIME_PROFILES[endpoint.runtime_variant].health_check_endpoint:


Indexing MODEL_SERVICE_RUNTIME_PROFILES by endpoint.runtime_variant may throw a KeyError for variants not registered in the map. Add a guard or default fallback to handle unexpected variants.

Suggested change

if _path := MODEL_SERVICE_RUNTIME_PROFILES[endpoint.runtime_variant].health_check_endpoint:

runtime_profile = MODEL_SERVICE_RUNTIME_PROFILES.get(endpoint.runtime_variant)

if runtime_profile and (_path := runtime_profile.health_check_endpoint):

feat(BA-992): Offload health check capability to AppProxy

1734a4f

Copilot AI review requested due to automatic review settings July 15, 2025 09:38

github-actions bot assigned kyujin-cho Jul 15, 2025

kyujin-cho closed this Jul 15, 2025

kyujin-cho reopened this Jul 15, 2025

kyujin-cho changed the base branch from main to 25.11 July 15, 2025 09:39

Copilot AI reviewed Jul 15, 2025

View reviewed changes

HyeockJinKim added the skip:changelog Make the action workflow to skip towncrier check label Jul 15, 2025

HyeockJinKim approved these changes Jul 15, 2025

View reviewed changes

HyeockJinKim merged commit 1d7f5f5 into 25.11 Jul 15, 2025
12 of 14 checks passed

HyeockJinKim deleted the backport/health-check-offload branch July 15, 2025 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(BA-992): Offload health check capability to AppProxy #5147

feat(BA-992): Offload health check capability to AppProxy #5147

Uh oh!

kyujin-cho commented Jul 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 15, 2025

Uh oh!

Copilot AI Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

	if _path := MODEL_SERVICE_RUNTIME_PROFILES[endpoint.runtime_variant].health_check_endpoint:
	runtime_profile = MODEL_SERVICE_RUNTIME_PROFILES.get(endpoint.runtime_variant)
	if runtime_profile and (_path := runtime_profile.health_check_endpoint):

feat(BA-992): Offload health check capability to AppProxy #5147

feat(BA-992): Offload health check capability to AppProxy #5147

Uh oh!

Conversation

kyujin-cho commented Jul 15, 2025

Summary

Breaking Changes

Changes

API Specifications

Health Check System

Route Management

Database Changes

Technical Details

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!