-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S Server Version: v1.28.15
- NIM Operator Version: 1.0.1
root@bcm10-headnode:~/nim-operator-workspace# helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-1734685148 gpu-operator 1 2024-12-20 16:59:09.420816993 +0800 CST deployed gpu-operator-v24.9.1 v24.9.1
k8s-nim-operator nim-operator 1 2025-01-10 10:01:29.551913422 +0800 CST deployed k8s-nim-operator-1.0.1 1.0.1
local-path-provisioner cm 1 2024-12-20 16:02:55.986116685 +0800 CST deployed local-path-provisioner-0.0.30 v0.0.30
network-operator network-operator 1 2024-12-20 16:05:57.780143109 +0800 CST deployed network-operator-24.7.0 v24.7.0
2. Issue or feature description
TRY TO SET NIM_MAX_MODEL_LEN , to decrease gpu memory , but looks not work.
and according to this page https://docs.nvidia.com/nim/large-language-models/latest/configuration.html ,
i can not found gpu_memory_utilization
and enforce_eager
, i don't know how to setting these two params.
and also i pass NIM_MAX_MODEL_LEN is not work .
root@bcm10-headnode:~/nim-operator-workspace# kubectl create -f nv-llama3-8b-instruct-nim-service-nimcache.yaml
nimservice.apps.nvidia.com/nv-llama3-8b-instruct created
root@bcm10-headnode:~/nim-operator-workspace# kubectl get nimservices.apps.nvidia.com -n nim-service
NAME STATUS AGE
nv-llama3-8b-instruct NotReady 2s
root@bcm10-headnode:~/nim-operator-workspace# kubectl get nimservices.apps.nvidia.com -n nim-service
NAME STATUS AGE
nv-llama3-8b-instruct NotReady 12s
root@bcm10-headnode:~/nim-operator-workspace# kubectl get pod -n nim-service
NAME READY STATUS RESTARTS AGE
nv-llama3-8b-instruct-65bcb494c5-rhfll 0/1 Running 0 19s
root@bcm10-headnode:~/nim-operator-workspace# kubectl logs -f -n nim-service nv-llama3-8b-instruct-65bcb494c5-rhfll
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
2025-01-20 06:55:58,149 [INFO] PyTorch version 2.2.2 available.
2025-01-20 06:55:59,070 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-20 06:55:59,070 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level by INFO
2025-01-20 06:55:59,300 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-20 06:55:59.944 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-20 06:55:59.945 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-20 06:55:59.945 ngc_profile.py:220] Detected 1 compatible profile(s).
INFO 01-20 06:55:59.945 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
INFO 01-20 06:55:59.946 ngc_injector.py:142] Selected profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
INFO 01-20 06:55:59.946 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: tp: 1
INFO 01-20 06:55:59.947 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-20 06:55:59.949 ngc_injector.py:173] Model workspace is now ready. It took 0.002 seconds
INFO 01-20 06:55:59.951 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-1qw4xz20', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-1qw4xz20', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-20 06:56:00.215 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-20 06:56:00.229 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 01-20 06:56:01 selector.py:28] Using FlashAttention backend.
INFO 01-20 06:56:04 model_runner.py:173] Loading model weights took 14.9595 GB
INFO 01-20 06:56:06.85 gpu_executor.py:119] # GPU blocks: 35035, # CPU blocks: 2048
INFO 01-20 06:56:07 model_runner.py:973] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-20 06:56:07 model_runner.py:977] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
[nv-llama3-8b-instruct-65bcb494c5-rhfll:00031] *** Process received signal ***
3. some else
if i use vllm image , i can pass params by args ..., and it's work ..
containers:
- name: qwen-72b
image: vllm/vllm-openai:latest
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args: [
"vllm serve /model-cache/modelscope/hub/Qwen/Qwen2___5-72B-Instruct --trust-remote-code --enable-chunked-prefill --max_num_batc
hed_tokens 1024 --served-model-name qwen-72b --gpu_memory_utilization 0.7 --tensor_parallel_size 4 --enforce-eager"
]
ports:
- containerPort: 8000
env:
- name: PYTORCH_CUDA_ALLOC_CONF
value: "expandable_segments:True"
- name: VLLM_USE_MODELSCOPE
value: "True"
resources:
limits:
nvidia.com/gpu: "8"




Metadata
Metadata
Assignees
Labels
No labels