Skip to content

Conversation

kyujin-cho
Copy link
Member

@kyujin-cho kyujin-cho commented Jul 14, 2025

Resolves #3051 (BA-992).

Summary

  • Refactor model service health check architecture to offload health checking to AppProxy
  • Update API specifications for endpoint route management and health check configuration
  • Implement separate model entity handling for health check configuration
  • Add support for reading health check information from model-definition.yaml

Breaking Changes

Important: This PR breaks health check capability on OSS AppProxy. Future work will restore support, but for now disable the health check feature in model-definition.yaml to use model service on Open Source Backend.AI.

Changes

API Specifications

  • Updated model serving event types with new ModelServiceStatusEventArgs base class
  • Added EndpointRouteListUpdatedEvent for endpoint route synchronization
  • Refactored anycast/broadcast event structures for better separation of concerns

Health Check System

  • Added ModelServiceHelper class replacing ModelServicePredicateChecker
  • Implemented health check configuration reading from model-definition.yaml
  • Added health check configuration passing to AppProxy during endpoint creation
  • Separated model definition validation into discrete functions

Route Management

  • Updated endpoint route generation to use Redis for AppProxy communication
  • Modified session lifecycle handling for route status updates
  • Added generate_redis_route_info method for serializable connection data
  • Implemented notify_endpoint_route_update_to_appproxy for real-time updates

Database Changes

  • Enhanced endpoint model with Redis route info generation for AppProxy integration
  • Updated routing status management in session callbacks
  • Improved error handling and retry logic for route operations

Technical Details

The health check system now reads configuration from either:

  1. Runtime variant profiles for predefined endpoints
  2. model-definition.yaml for custom runtime variants

Health check configuration is passed to AppProxy during endpoint creation, allowing AppProxy to handle health checking independently. Route connection information is stored in Redis for AppProxy consumption with the key pattern:
endpoint.{endpoint_id}.route_connection_info

Testing

Existing tests should continue to pass. New functionality requires model service integration testing with AppProxy components.

@Copilot Copilot AI review requested due to automatic review settings July 14, 2025 12:04
@kyujin-cho kyujin-cho changed the title Offload health check capability to Redis with updated API specs Offload health check capability to AppProxy with updated API specs Jul 14, 2025
@kyujin-cho kyujin-cho changed the title Offload health check capability to AppProxy with updated API specs feat(BA-992): Offload health check capability to AppProxy Jul 14, 2025
Copilot

This comment was marked as outdated.

@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component comp:agent Related to Agent component comp:common Related to Common component labels Jul 14, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors how health checks and route updates are offloaded to AppProxy. Key changes include:

  • Introducing ModelServiceHelper and reading health check config from model-definition.yaml
  • Switching route updates to Redis-based notifications (valkey_live) instead of direct AppProxy HTTP calls
  • Adding Pydantic HealthCheckConfig and YAML-driven health check extraction

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/ai/backend/manager/services/processors.py Injected ValkeyLiveClient into ServiceArgs
src/ai/backend/manager/services/model_serving/types.py Added RouteConnectionInfo dataclass
src/ai/backend/manager/services/model_serving/services/model_serving.py Refactored endpoint creation and route update flows to use ModelServiceHelper and Redis
src/ai/backend/manager/registry.py Implemented AppProxy endpoint creation, health-check support, and Redis notifications
src/ai/backend/common/data/config/types.py Added HealthCheckConfig Pydantic model
Comments suppressed due to low confidence (3)

src/ai/backend/manager/registry.py:3586

  • Add unit tests for get_health_check_info to cover both standard runtime variants and CUSTOM model-definition.yaml parsing to ensure health check configurations are correctly extracted.
    async def get_health_check_info(

src/ai/backend/manager/registry.py:3640

  • [nitpick] Add or update the docstring for create_appproxy_endpoint to explain parameters, return value, and health_check payload format for better maintainability.
            async with session.post(

src/ai/backend/manager/registry.py:3658

  • The create_appproxy_endpoint payload no longer includes any 'apps' or inference routing info; ensure AppProxy can derive routing solely from Redis notifications or re-add necessary fields to the payload if required.
                    "health_check": health_check_information.model_dump(mode="json")

Comment on lines +63 to +67
@dataclass
class RouteConnectionInfo:
app: str
kernel_host: str
kernel_port: int
Copy link
Preview

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The RouteConnectionInfo dataclass is defined but never used; consider removing it or integrating it into generate_redis_route_info for consistent typing.

Suggested change
@dataclass
class RouteConnectionInfo:
app: str
kernel_host: str
kernel_port: int
# Removed the unused RouteConnectionInfo dataclass as it is not referenced anywhere in the code.

Copilot uses AI. Check for mistakes.

@@ -3671,6 +3688,19 @@ async def delete_appproxy_endpoint(self, db_sess: AsyncSession, endpoint: Endpoi
):
pass

async def notify_endpoint_route_update_to_appproxy(self, endpoint: EndpointRow) -> None:
Copy link
Preview

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider wrapping Redis operations in notify_endpoint_route_update_to_appproxy within try/except to handle potential valkey_live errors and avoid unhandled exceptions disrupting routing updates.

Copilot uses AI. Check for mistakes.

session_id: SessionId
model_name: str
new_status: ModelServiceStatus
class RouteCreationEvent(AbstractAnycastEvent):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RouteLifecycleEvent?

@HyeockJinKim HyeockJinKim changed the base branch from main to 25.11 July 15, 2025 04:21
@HyeockJinKim HyeockJinKim changed the base branch from 25.11 to main July 15, 2025 07:24
@HyeockJinKim
Copy link
Collaborator

Please fix this conflict tomorrow. @kyujin-cho

@HyeockJinKim
Copy link
Collaborator

Please resolve conflicts.

@HyeockJinKim HyeockJinKim force-pushed the feature/health-check-offload branch from ec25b1b to 0258e66 Compare July 19, 2025 08:49
@HyeockJinKim HyeockJinKim enabled auto-merge July 19, 2025 08:50
@HyeockJinKim HyeockJinKim force-pushed the feature/health-check-offload branch from 0258e66 to 192e55a Compare July 19, 2025 08:55
@HyeockJinKim HyeockJinKim added this pull request to the merge queue Jul 19, 2025
Merged via the queue into main with commit 90413a6 Jul 19, 2025
29 checks passed
@HyeockJinKim HyeockJinKim deleted the feature/health-check-offload branch July 19, 2025 09:09
seedspirit pushed a commit that referenced this pull request Jul 23, 2025
seedspirit pushed a commit that referenced this pull request Jul 24, 2025
seedspirit pushed a commit that referenced this pull request Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:agent Related to Agent component comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advanced High-Availability considerations for model service
2 participants