Skip to content

Conversation

kyujin-cho
Copy link
Member

Resolves #3051 (BA-992).

Summary

  • Refactor model service health check architecture to offload health checking to AppProxy
  • Update API specifications for endpoint route management and health check configuration
  • Implement separate model entity handling for health check configuration
  • Add support for reading health check information from model-definition.yaml

Breaking Changes

Important: This PR breaks health check capability on OSS AppProxy. Future work will restore support, but for now disable the health check feature in model-definition.yaml to use model service on Open Source Backend.AI.

Changes

API Specifications

  • Updated model serving event types with new ModelServiceStatusEventArgs base class
  • Added EndpointRouteListUpdatedEvent for endpoint route synchronization
  • Refactored anycast/broadcast event structures for better separation of concerns

Health Check System

  • Added ModelServiceHelper class replacing ModelServicePredicateChecker
  • Implemented health check configuration reading from model-definition.yaml
  • Added health check configuration passing to AppProxy during endpoint creation
  • Separated model definition validation into discrete functions

Route Management

  • Updated endpoint route generation to use Redis for AppProxy communication
  • Modified session lifecycle handling for route status updates
  • Added generate_redis_route_info method for serializable connection data
  • Implemented notify_endpoint_route_update_to_appproxy for real-time updates

Database Changes

  • Enhanced endpoint model with Redis route info generation for AppProxy integration
  • Updated routing status management in session callbacks
  • Improved error handling and retry logic for route operations

Technical Details

The health check system now reads configuration from either:

  1. Runtime variant profiles for predefined endpoints
  2. model-definition.yaml for custom runtime variants

Health check configuration is passed to AppProxy during endpoint creation, allowing AppProxy to handle health checking independently. Route connection information is stored in Redis for AppProxy consumption with the key pattern:
endpoint.{endpoint_id}.route_connection_info

Testing

Existing tests should continue to pass. New functionality requires model service integration testing with AppProxy components.

@Copilot Copilot AI review requested due to automatic review settings July 15, 2025 09:38
@kyujin-cho kyujin-cho closed this Jul 15, 2025
@kyujin-cho kyujin-cho reopened this Jul 15, 2025
@kyujin-cho kyujin-cho changed the base branch from main to 25.11 July 15, 2025 09:39
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors model service health checks by offloading them to AppProxy, updates route management via Redis, and replaces the old predicate checker with a unified helper.

  • Introduce health check configuration reading and pass it to AppProxy
  • Refactor ModelServicePredicateChecker into ModelServiceHelper with consolidated validation methods
  • Persist route connection info in Redis (redis_live) and emit real-time update events

Reviewed Changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/ai/backend/manager/services/processors.py Add redis_live to ServiceArgs
src/ai/backend/manager/services/model_serving/types.py Define RouteConnectionInfo dataclass
src/ai/backend/manager/services/model_serving/services/model_serving.py Integrate redis_helper, health checks, notifications
src/ai/backend/manager/registry.py Add get_health_check_info, create_appproxy_endpoint, notify_endpoint_route_update_to_appproxy
src/ai/backend/manager/models/endpoint.py Implement generate_redis_route_info, rename helper
src/ai/backend/manager/models/routing.py Update default status_filter, add load_session
src/ai/backend/manager/models/storage.py Raise VFolderGone on 410 response
src/ai/backend/manager/models/vfolder.py Catch VFolderGone in deletion
src/ai/backend/common/data/config/types.py Add HealthCheckConfig Pydantic model
Comments suppressed due to low confidence (4)

src/ai/backend/manager/registry.py:3659

  • [nitpick] Consider adding a docstring to get_health_check_info to explain how it selects or reads health check settings for different runtime variants, including fallback behavior.
    async def get_health_check_info(

src/ai/backend/manager/services/model_serving/types.py:64

  • [nitpick] The field name app in RouteConnectionInfo is ambiguous. Consider renaming it to app_name or service_name for clarity.
class RouteConnectionInfo:

src/ai/backend/manager/registry.py:3692

  • Consider adding unit or integration tests for create_appproxy_endpoint and get_health_check_info to verify correct serialization of health check configs, Redis writes, and error handling.
    async def create_appproxy_endpoint(

src/ai/backend/manager/models/routing.py:125

  • Changing the default status_filter to include all active_route_statuses may change behavior for callers that relied on the previous default (HEALTHY, UNHEALTHY, PROVISIONING). Consider explicitly specifying the intended filter in existing call sites or updating tests.
        status_filter: list[RouteStatus] = list(RouteStatus.active_route_statuses()),

lambda r: r.set(
f"endpoint.{action.service_id}.session.{route.session}.traffic_ratio",
str(action.traffic_ratio),
ex=3600,
Copy link
Preview

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TTL value 3600 is hardcoded in multiple places. Extract this into a named constant or configuration parameter to improve maintainability.

Suggested change
ex=3600,
ex=DEFAULT_TTL_SECONDS,

Copilot uses AI. Check for mistakes.

) -> HealthCheckConfig | None:
_info: HealthCheckConfig | None = None

if _path := MODEL_SERVICE_RUNTIME_PROFILES[endpoint.runtime_variant].health_check_endpoint:
Copy link
Preview

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing MODEL_SERVICE_RUNTIME_PROFILES by endpoint.runtime_variant may throw a KeyError for variants not registered in the map. Add a guard or default fallback to handle unexpected variants.

Suggested change
if _path := MODEL_SERVICE_RUNTIME_PROFILES[endpoint.runtime_variant].health_check_endpoint:
runtime_profile = MODEL_SERVICE_RUNTIME_PROFILES.get(endpoint.runtime_variant)
if runtime_profile and (_path := runtime_profile.health_check_endpoint):

Copilot uses AI. Check for mistakes.

@HyeockJinKim HyeockJinKim added the skip:changelog Make the action workflow to skip towncrier check label Jul 15, 2025
@HyeockJinKim HyeockJinKim merged commit 1d7f5f5 into 25.11 Jul 15, 2025
12 of 14 checks passed
@HyeockJinKim HyeockJinKim deleted the backport/health-check-offload branch July 15, 2025 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip:changelog Make the action workflow to skip towncrier check
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advanced High-Availability considerations for model service
2 participants