Skip to content

pass env params into nim looks not work (how to set params?) #297

@RandyChen1985

Description

@RandyChen1985

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S Server Version: v1.28.15
  • NIM Operator Version: 1.0.1
root@bcm10-headnode:~/nim-operator-workspace# helm list -A
NAME                   	NAMESPACE       	REVISION	UPDATED                                	STATUS  	CHART                        	APP VERSION
gpu-operator-1734685148	gpu-operator    	1       	2024-12-20 16:59:09.420816993 +0800 CST	deployed	gpu-operator-v24.9.1         	v24.9.1
k8s-nim-operator       	nim-operator    	1       	2025-01-10 10:01:29.551913422 +0800 CST	deployed	k8s-nim-operator-1.0.1       	1.0.1
local-path-provisioner 	cm              	1       	2024-12-20 16:02:55.986116685 +0800 CST	deployed	local-path-provisioner-0.0.30	v0.0.30
network-operator       	network-operator	1       	2024-12-20 16:05:57.780143109 +0800 CST	deployed	network-operator-24.7.0      	v24.7.0

2. Issue or feature description

TRY TO SET NIM_MAX_MODEL_LEN , to decrease gpu memory , but looks not work.

and according to this page https://docs.nvidia.com/nim/large-language-models/latest/configuration.html ,

i can not found gpu_memory_utilization and enforce_eager , i don't know how to setting these two params.

and also i pass NIM_MAX_MODEL_LEN is not work .

root@bcm10-headnode:~/nim-operator-workspace# kubectl create -f nv-llama3-8b-instruct-nim-service-nimcache.yaml
nimservice.apps.nvidia.com/nv-llama3-8b-instruct created
root@bcm10-headnode:~/nim-operator-workspace# kubectl get nimservices.apps.nvidia.com -n nim-service
NAME                    STATUS     AGE
nv-llama3-8b-instruct   NotReady   2s
root@bcm10-headnode:~/nim-operator-workspace# kubectl get nimservices.apps.nvidia.com -n nim-service
NAME                    STATUS     AGE
nv-llama3-8b-instruct   NotReady   12s
root@bcm10-headnode:~/nim-operator-workspace# kubectl get pod -n nim-service
NAME                                     READY   STATUS    RESTARTS   AGE
nv-llama3-8b-instruct-65bcb494c5-rhfll   0/1     Running   0          19s
root@bcm10-headnode:~/nim-operator-workspace# kubectl logs -f -n nim-service nv-llama3-8b-instruct-65bcb494c5-rhfll

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2025-01-20 06:55:58,149 [INFO] PyTorch version 2.2.2 available.
2025-01-20 06:55:59,070 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-20 06:55:59,070 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level by INFO
2025-01-20 06:55:59,300 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-20 06:55:59.944 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-20 06:55:59.945 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-20 06:55:59.945 ngc_profile.py:220] Detected 1 compatible profile(s).
INFO 01-20 06:55:59.945 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
INFO 01-20 06:55:59.946 ngc_injector.py:142] Selected profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
INFO 01-20 06:55:59.946 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: tp: 1
INFO 01-20 06:55:59.947 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-20 06:55:59.949 ngc_injector.py:173] Model workspace is now ready. It took 0.002 seconds
INFO 01-20 06:55:59.951 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-1qw4xz20', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-1qw4xz20', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-20 06:56:00.215 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-20 06:56:00.229 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 01-20 06:56:01 selector.py:28] Using FlashAttention backend.
INFO 01-20 06:56:04 model_runner.py:173] Loading model weights took 14.9595 GB
INFO 01-20 06:56:06.85 gpu_executor.py:119] # GPU blocks: 35035, # CPU blocks: 2048
INFO 01-20 06:56:07 model_runner.py:973] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-20 06:56:07 model_runner.py:977] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
[nv-llama3-8b-instruct-65bcb494c5-rhfll:00031] *** Process received signal ***

3. some else

if i use vllm image , i can pass params by args ..., and it's work ..

containers:
      - name: qwen-72b
        image: vllm/vllm-openai:latest
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve /model-cache/modelscope/hub/Qwen/Qwen2___5-72B-Instruct --trust-remote-code --enable-chunked-prefill --max_num_batc
hed_tokens 1024 --served-model-name qwen-72b  --gpu_memory_utilization 0.7 --tensor_parallel_size 4 --enforce-eager"
        ]
        ports:
        - containerPort: 8000
        env:
        - name: PYTORCH_CUDA_ALLOC_CONF
          value: "expandable_segments:True"
        - name: VLLM_USE_MODELSCOPE
          value: "True"
        resources:
          limits:
            nvidia.com/gpu: "8"
Image Image Image Image

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions