You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently Model Service relies on the health check information provided by the kernel runner operating on each container. As the container itself acts as the only source, the health status cannot be determined whenever entire GPU node shuts down.To guarantee the activeness of each model service, it is crucial to check whether the container itself is unresponsive and try to reconcile the replica size if it is. We can suggest following improvements to resolve the issue:
Make AppProxy as the health checker
Add an option to automatically terminate unhealthy sessions after a certain grace period