This guide provides instructions for deploying the NVIDIA AI-Q Research Assistant blueprint using Helm on a Kubernetes cluster.
- A NGC API key that is able to access the AI-Q blueprint images. A key can be generated at https://org.ngc.nvidia.com/setup/api-keys. For Services Included, select NGC Catalog and Public API Endpoints.
- Kubernetes and Helm with NVIDIA GPU Operator installed. This helm chart was tested on Cloud Native Stack
- [Optional] A Tavily API key to support web search.
The AI-Q Research Assistant blueprint requires the deployment of the NVIDIA RAG blueprint. To deploy both blueprints using Helm requires the following hardware configurations:
| Option | RAG Deployment | AIRA Deployment | Total Hardware Requirement |
|---|---|---|---|
| Single Node - MIG Sharing | Use MIG sharing | Default Deployment | 4 x H100 80GB for RAG 2 x H100 80GB for AIRA |
| Multi Node | Default Deployment | Default Deployment | 8 x H100 80GB for RAG 2 x H100 80GB for AIRA --- 9 x A100 80GB for RAG 4 x A100 80GB for AIRA --- 9 x B200 for RAG 2 x B200 for AIRA --- 8 x RTX PRO 6000 for RAG 2 x RTX PRO 6000 for AIRA |
Note: Mixed MIG support requires GPU operator 25.3.2 or higher and NVIDIA Driver 570.172.08 or higher.
Follow the NVIDIA RAG blueprint Helm deployment guide.
export NGC_API_KEY="<your-ngc-api-key>"
export TAVILY_API_KEY="<your-tavily-api-key>"git clone https://github.com/NVIDIA-AI-Blueprints/aiq-research-assistantcd aiq-research-assistant/deploy/helmkubectl create namespace aiqTo deploy pre-built chart from NGC:
helm install aiq-aira https://helm.ngc.nvidia.com/nvidia/blueprint/charts/aiq-aira-v1.2.1.tgz \
--username='$oauthtoken' \
--password=$NGC_API_KEY \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
--set tavilyApiSecret.password=$TAVILY_API_KEY -n aiqTo deploy from source:
helm install aiq-aira aiq-aira/ \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
--set tavilyApiSecret.password=$TAVILY_API_KEY -n aiqThe deployment commands above assume the RAG deployment instructions were followed in Deploy RAG with the RAG services running in the rag namespace. If using a different RAG deployment, the default service URLs can be overridden by setting backendEnvVars.RAG_SERVER_URL and backendEnvVars.RAG_INGEST_URL. For example:
helm install aiq-aira aiq-aira/ \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
--set tavilyApiSecret.password=$TAVILY_API_KEY \
--set backendEnvVars.RAG_SERVER_URL=<RAG_SERVER_URL> \
--set backendEnvVars.RAG_INGEST_URL=<INGESTOR_SERVER_URL> -n aiqBy default, the deployment of the instruct LLM automatically selects the most suitable profile from the list of compatible profiles based on the detected hardware. If you encounter issues with the selected profile or prefer to use a different compatible profile, you can explicitly select the profile by adding the NIM_MODEL_PROFILE environment variable to the nim-llm section in values.yaml.
You can list available profiles by running the NIM container directly:
USERID=$(id -u) docker run --rm --gpus all \
nvcr.io/nim/meta/llama-3.3-70b-instruct:1.14.0 \
list-model-profilesUsing the list of model profiles from the previous step, add the NIM_MODEL_PROFILE in the nim-llm section of the values.yaml. It is ideal to select one of the tensorrt_llm profiles for best performance. Here is an example of selecting one of these profiles for two H100 GPUs:
nim-llm:
enabled: true
resources:
limits:
nvidia.com/gpu: 2
requests:
nvidia.com/gpu: 2
env: # Add this section
- name: NIM_MODEL_PROFILE
value: "tensorrt_llm-h100-fp8-tp2-pp1-throughput-2330:10de-0013e870ea929584ec13dad6948450024cdc6c2f03a865f1b050fb08b9f64312-2"
model:
ngcAPIKey: ""
name: "meta/llama-3.3-70b-instruct"
If using A100s, the nim-llm section will also have to be updated to allocate four GPUs instead of two:
resources:
limits:
nvidia.com/gpu: 4
requests:
nvidia.com/gpu: 4
The following tensorrt_llm profiles are optimized for different common GPU configurations:
tensorrt_llm-h100_nvl-fp8-tp2-pp1-throughput-2321:10de-3035d73242fb579040fb3f341adc36a7073f780419e73dd97edb7ce35cb0f550-2
tensorrt_llm-h100-fp8-tp2-pp1-throughput-2330:10de-0013e870ea929584ec13dad6948450024cdc6c2f03a865f1b050fb08b9f64312-2
tensorrt_llm-a100-bf16-tp4-pp1-throughput-20b2:10de-f14e1bad1a0e78da150aeedfee7919ab3ef21def09825caffef460b93fdde9b7-4
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp2-pp1-throughput-2bb5:10de-77ab630b949b0a58ad580a22ea055bc392a30fbf57357d6398814e00775aab8c-2
tensorrt_llm-b200-bf16-tp2-pp1-throughput-2901:10de-6d1452af26f860b53df112c90f6b92f22a41156c09dafa2582c2c1194e56a673-2
More information about model profile selection can be found here in the NVIDIA NIM for Large Language Models (LLMs) documentation.
kubectl get pods -n aiqResponse should look like this:
NAME READY STATUS RESTARTS AGE
aiq-aira-aira-backend-5797589756-td5b2 1/1 Running 0 5m
aiq-aira-aira-frontend-74ff7cc5c8-wf9jx 1/1 Running 0 5m
aiq-aira-nim-llm-0 1/1 Running 0 5m
aiq-aira-phoenix-78fd7584b7-s9bwc 1/1 Running 0 5m
Since the frontend service has a nodePort configured for port 30080, you can view the UI from a web browser on the host running kubectl at http://localhost:30080.
The UI can also be viewed from outside the cluster at: http://<cluster-node-name-or-ip>:30080
The AI-Q NVIDIA Research Assistant demo web application requires two default collections. One collection supports a biomedical research prompt and contains reports on Cystic Fibrosis. The second supports a financial research prompt and contains public financial documents from Alphabet, Meta, and Amazon.
Follow the steps in Bulk Upload via Python to create these default collections.
To stop all services, run the following commands:
- Delete the AIRA deployment:
helm delete aiq-aira -n aiq- Delete the RAG deployment:
helm delete rag -n rag- Delete the namespaces:
kubectl delete namespace aiq
kubectl delete namespace rag