Your current environment
Docker image:
vllm/vllm-openai:nightly
https://hub.docker.com/layers/vllm/vllm-openai/nightly/images/sha256-2b5f940431016b25c461761cb813cebd1f02a9e4ba1069226a5c1c9ffb6834c6
vLLM version:
0.21.1rc1.dev262+g33d7cbe02
Model:
RedHatAI/gemma-4-31B-it-NVFP4
Related issue:
#43480
🐛 Describe the bug
I previously reported a similar startup failure in #43480, where the nightly Docker image failed because pytest was not installed and was imported indirectly via humming / cupy.testing.
After pulling a newer nightly image, the original failure path seems to have changed, but the server still fails to start because pytest is missing.
In this newer build, the model is loaded successfully, but EngineCore fails during startup while vLLM is initializing KV caches and running the profiling dummy run.
The failure path is now roughly:
EngineCore startup
-> _initialize_kv_caches
-> determine_available_memory
-> gpu_worker.profile_run
-> gpu_model_runner._dummy_run
-> torch._dynamo AOT compile
-> torch.distributed.tensor.experimental._context_parallel._cp_custom_ops
-> torch.library.custom_op / _register_fake
-> inspect.getframeinfo / inspect.getmodule
-> cupy.testing
-> import pytest
-> ModuleNotFoundError: No module named 'pytest'
So this appears to be the same underlying runtime dependency / import side-effect issue as #43480, but it is now triggered from a different code path during EngineCore initialization rather than during the earlier quantization config verification path.
Since pytest is normally a test dependency, the official runtime Docker image should not require it for normal vLLM server startup.
Startup arguments
The server was started with the following non-default arguments shown in the log:
{
'model_tag': 'RedHatAI/gemma-4-31B-it-NVFP4',
'default_chat_template_kwargs': {'enable_thinking': True},
'enable_auto_tool_choice': True,
'tool_call_parser': 'gemma4',
'host': '0.0.0.0',
'model': 'RedHatAI/gemma-4-31B-it-NVFP4',
'trust_remote_code': True,
'max_model_len': 256000,
'served_model_name': ['gemma4-31b'],
'reasoning_parser': 'gemma4',
'kv_cache_dtype': 'fp8',
'mm_processor_kwargs': {'max_soft_tokens': 1120},
'max_num_batched_tokens': 8192,
'max_num_seqs': 32,
'scheduler_reserve_full_isl': False,
'async_scheduling': True,
'optimization_level': '3'
}
Expected behavior
vllm/vllm-openai:nightly should start successfully without requiring the test dependency pytest.
If some runtime dependency indirectly imports cupy.testing, the Docker image should either include the required dependency or avoid importing test-only modules during normal server startup.
Actual behavior
The server fails during EngineCore initialization.
The important part of the traceback is:
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 396, in determine_available_memory
self.model_runner.profile_run()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6164, in profile_run
hidden_states, last_hidden_states = self._dummy_run(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5824, in _dummy_run
outputs = self.model(
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mm.py", line 1487, in forward
hidden_states = self.language_model.model(
File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 663, in __call__
self.aot_compiled_fn = self.aot_compile(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 873, in aot_compile
return aot_compile_fullgraph(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/experimental/_context_parallel/_cp_custom_ops.py", line 8, in <module>
@torch.library.custom_op("cplib::flex_cp_allgather", mutates_args=())
File "/usr/local/lib/python3.12/dist-packages/torch/_library/utils.py", line 45, in get_source
frame = inspect.getframeinfo(sys._getframe(stacklevel))
File "/usr/lib/python3.12/inspect.py", line 1007, in getmodule
if ismodule(module) and hasattr(module, '__file__'):
File "/usr/local/lib/python3.12/dist-packages/cupy/testing/__init__.py", line 50, in <module>
from cupy.testing._random import fix_random # NOQA
File "/usr/local/lib/python3.12/dist-packages/cupy/testing/_random.py", line 11, in <module>
import pytest
ModuleNotFoundError: No module named 'pytest'
Then the API server exits with:
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Notes
This does not look like a model download or Hugging Face authentication issue. The model checkpoint is loaded successfully before the failure.
This also does not look specific to the earlier humming import path reported in #43480. The new failure path goes through torch._dynamo / torch.distributed.tensor.experimental / cupy.testing, but it reaches the same root cause:
ModuleNotFoundError: No module named 'pytest'
Could you please check whether the nightly runtime image should include pytest, or whether cupy.testing should be avoided during normal vLLM server startup?
Before submitting a new issue...
Your current environment
Docker image:
vllm/vllm-openai:nightly
https://hub.docker.com/layers/vllm/vllm-openai/nightly/images/sha256-2b5f940431016b25c461761cb813cebd1f02a9e4ba1069226a5c1c9ffb6834c6
vLLM version:
0.21.1rc1.dev262+g33d7cbe02
Model:
RedHatAI/gemma-4-31B-it-NVFP4
Related issue:
#43480
🐛 Describe the bug
I previously reported a similar startup failure in #43480, where the nightly Docker image failed because
pytestwas not installed and was imported indirectly viahumming/cupy.testing.After pulling a newer nightly image, the original failure path seems to have changed, but the server still fails to start because
pytestis missing.In this newer build, the model is loaded successfully, but
EngineCorefails during startup while vLLM is initializing KV caches and running the profiling dummy run.The failure path is now roughly:
So this appears to be the same underlying runtime dependency / import side-effect issue as #43480, but it is now triggered from a different code path during EngineCore initialization rather than during the earlier quantization config verification path.
Since
pytestis normally a test dependency, the official runtime Docker image should not require it for normal vLLM server startup.Startup arguments
The server was started with the following non-default arguments shown in the log:
{ 'model_tag': 'RedHatAI/gemma-4-31B-it-NVFP4', 'default_chat_template_kwargs': {'enable_thinking': True}, 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'host': '0.0.0.0', 'model': 'RedHatAI/gemma-4-31B-it-NVFP4', 'trust_remote_code': True, 'max_model_len': 256000, 'served_model_name': ['gemma4-31b'], 'reasoning_parser': 'gemma4', 'kv_cache_dtype': 'fp8', 'mm_processor_kwargs': {'max_soft_tokens': 1120}, 'max_num_batched_tokens': 8192, 'max_num_seqs': 32, 'scheduler_reserve_full_isl': False, 'async_scheduling': True, 'optimization_level': '3' }Expected behavior
vllm/vllm-openai:nightlyshould start successfully without requiring the test dependencypytest.If some runtime dependency indirectly imports
cupy.testing, the Docker image should either include the required dependency or avoid importing test-only modules during normal server startup.Actual behavior
The server fails during
EngineCoreinitialization.The important part of the traceback is:
Then the API server exits with:
Notes
This does not look like a model download or Hugging Face authentication issue. The model checkpoint is loaded successfully before the failure.
This also does not look specific to the earlier
hummingimport path reported in #43480. The new failure path goes throughtorch._dynamo/torch.distributed.tensor.experimental/cupy.testing, but it reaches the same root cause:Could you please check whether the nightly runtime image should include
pytest, or whethercupy.testingshould be avoided during normal vLLM server startup?Before submitting a new issue...