Skip to content

Commit b8fcc8b

Browse files
fix(vertexai): prevent RuntimeError from stale client after startup event loop (backport #6072) (#6105)
Ran into #6057 while setting up a VertexAI provider — the server would crash with `RuntimeError: Event loop is closed` on the first inference request after startup. Turns out the issue is that during `StackApp.__init__`, the stack initialization runs in a temporary event loop, and `refresh_registry_once()` triggers model listing which calls `_get_client()` on the VertexAI adapter. The Google genai `Client` eagerly creates an `httpx.AsyncClient` internal to itself, binding it to that temporary loop. After the temp loop goes away and uvicorn starts on a fresh loop, the cached client is still holding connections tied to the dead loop. Two things in this PR: 1. Added `_reset_client()` on `VertexAIInferenceAdapter` — clears the cached default client and HTTP options. This is called from `StackApp.__init__` right after `reset_sqlstore_engines()`, following the exact same pattern that already exists for SQL engines. 2. Added a safety check in `_get_client()` itself — before returning the cached default client, it checks whether the underlying httpx transport has been closed. If it has (which happens when the event loop it was created on is terminated), it logs and recreates the client. This is defense-in-depth in case the reset isn't called. Not entirely sure about the `is_closed` check — it relies on httpx's internal state tracking which seems stable across recent versions but could change. Happy to remove that part if you'd prefer to keep it simpler. ## Test Plan Ran `python3.12 -m py_compile` on both modified files — they compile cleanly. The existing test suite should cover the normal code paths since these changes only affect the initialization/recreation path. The event loop simulation is tricky to unit test without bringing up a full server, but the pattern mirrors the tested `reset_sqlstore_engines()` flow exactly.<hr>This is an automatic backport of pull request #6072 done by [Mergify](https://mergify.com). Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com> Co-authored-by: goingforstudying-ctrl <goingforstudying@gmail.com> Co-authored-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
1 parent 557fed3 commit b8fcc8b

2 files changed

Lines changed: 58 additions & 1 deletion

File tree

src/ogx/core/server/server.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,17 @@ def __init__(self, config: StackConfig, *args: Any, **kwargs: Any) -> None:
147147

148148
reset_sqlstore_engines()
149149

150+
# Reset VertexAI provider clients that may have been created in the
151+
# temporary event loop during model listing (refresh_registry_once).
152+
# Like SQL engines, the Google genai Client eagerly binds an internal
153+
# httpx.AsyncClient to the current event loop, and the cached client
154+
# becomes unusable after the temporary loop is terminated.
155+
if self.stack.impls:
156+
for impl in self.stack.impls.values():
157+
reset_fn = getattr(impl, "_reset_client", None)
158+
if reset_fn is not None:
159+
reset_fn()
160+
150161

151162
@asynccontextmanager
152163
async def lifespan(app: StackApp) -> AsyncIterator[None]:

src/ogx/providers/remote/inference/vertexai/vertexai.py

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,29 @@ async def initialize(self) -> None:
202202
exc_info=True,
203203
)
204204

205+
def _reset_client(self) -> None:
206+
"""Reset cached client and HTTP options after a temporary event loop exits.
207+
208+
When StackApp.__init__ runs stack.initialize() inside a temporary event
209+
loop (via ThreadPoolExecutor), model listing may trigger lazy client
210+
creation via _get_client(). The Google genai Client eagerly creates an
211+
internal httpx.AsyncClient bound to the temporary loop. After the
212+
temporary loop is closed, the cached client holds connections tied to
213+
the dead loop, causing ``RuntimeError: Event loop is closed`` on the
214+
first inference request.
215+
216+
This method clears the cached client without awaiting async close
217+
(the temporary loop is already terminated) so that a fresh client is
218+
created on the next _get_client() call — this time on uvicorn's
219+
request-handling event loop.
220+
221+
Compare ``reset_sqlstore_engines()`` which serves the same purpose for
222+
SQL engines.
223+
"""
224+
self._default_client = None
225+
self._http_options = None
226+
self._http_options_initialized = False
227+
205228
async def shutdown(self) -> None:
206229
await self._close_managed_httpx_client()
207230
self._http_options = None
@@ -315,7 +338,30 @@ def _get_client(self) -> Client:
315338
access_token = self.config.auth_credential.get_secret_value() if self.config.auth_credential else None
316339
return self._create_client(project=project, location=location, access_token=access_token)
317340

318-
# Lazily create the default client on first use
341+
# Lazily create the default client on first use.
342+
# If we already have a cached client, verify it is still usable before
343+
# returning it — a previous request may have left connections tied to an
344+
# event loop that is now closed (e.g., after a temporary startup loop).
345+
if self._default_client is not None:
346+
try:
347+
# Touch the underlying httpx client to detect event loop binding
348+
# issues. If the client was created in a now-closed loop,
349+
# accessing its transport raises RuntimeError.
350+
if self._http_options is not None:
351+
_client = getattr(self._http_options, "httpx_async_client", None)
352+
if _client is not None and _client.is_closed:
353+
logger.info(
354+
"VertexAI default client transport is closed; recreating",
355+
project=self.config.project,
356+
)
357+
self._default_client = None
358+
except RuntimeError:
359+
logger.warning(
360+
"VertexAI default client is bound to a closed event loop; recreating",
361+
project=self.config.project,
362+
)
363+
self._default_client = None
364+
319365
if self._default_client is None:
320366
access_token = self.config.auth_credential.get_secret_value() if self.config.auth_credential else None
321367
try:

0 commit comments

Comments
 (0)