Skip to content

Latest commit

 

History

History
207 lines (152 loc) · 7.96 KB

File metadata and controls

207 lines (152 loc) · 7.96 KB

Get Started with AI-Q NVIDIA Research Assistant Blueprint Helm Deployment

This guide provides instructions for deploying the NVIDIA AI-Q Research Assistant blueprint using Helm on a Kubernetes cluster.

Prerequisites

  1. A NGC API key that is able to access the AI-Q blueprint images. A key can be generated at https://org.ngc.nvidia.com/setup/api-keys. For Services Included, select NGC Catalog and Public API Endpoints.
  2. Kubernetes and Helm with NVIDIA GPU Operator installed. This helm chart was tested on Cloud Native Stack
  3. [Optional] A Tavily API key to support web search.

Hardware Requirements

The AI-Q Research Assistant blueprint requires the deployment of the NVIDIA RAG blueprint. To deploy both blueprints using Helm requires the following hardware configurations:

Option RAG Deployment AIRA Deployment Total Hardware Requirement
Single Node - MIG Sharing Use MIG sharing Default Deployment 4 x H100 80GB for RAG
2 x H100 80GB for AIRA
Multi Node Default Deployment Default Deployment 8 x H100 80GB for RAG
2 x H100 80GB for AIRA
---
9 x A100 80GB for RAG
4 x A100 80GB for AIRA
---
9 x B200 for RAG
2 x B200 for AIRA
---
8 x RTX PRO 6000 for RAG
2 x RTX PRO 6000 for AIRA

Note: Mixed MIG support requires GPU operator 25.3.2 or higher and NVIDIA Driver 570.172.08 or higher.

Deployment

Deploy RAG

Follow the NVIDIA RAG blueprint Helm deployment guide.

Deploy the AI-Q Research Assistant

Set environment variables

export NGC_API_KEY="<your-ngc-api-key>"
export TAVILY_API_KEY="<your-tavily-api-key>"

Clone the repo

git clone https://github.com/NVIDIA-AI-Blueprints/aiq-research-assistant

Navigate to the helm chart directory

cd aiq-research-assistant/deploy/helm

Create a namespace for AIQ helm chart

kubectl create namespace aiq

Deploy the chart

To deploy pre-built chart from NGC:

helm install aiq-aira https://helm.ngc.nvidia.com/nvidia/blueprint/charts/aiq-aira-v1.2.1.tgz \
--username='$oauthtoken'  \
--password=$NGC_API_KEY \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
--set tavilyApiSecret.password=$TAVILY_API_KEY -n aiq

To deploy from source:

helm install aiq-aira aiq-aira/ \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
--set tavilyApiSecret.password=$TAVILY_API_KEY -n aiq

The deployment commands above assume the RAG deployment instructions were followed in Deploy RAG with the RAG services running in the rag namespace. If using a different RAG deployment, the default service URLs can be overridden by setting backendEnvVars.RAG_SERVER_URL and backendEnvVars.RAG_INGEST_URL. For example:

helm install aiq-aira aiq-aira/ \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
--set tavilyApiSecret.password=$TAVILY_API_KEY \
--set backendEnvVars.RAG_SERVER_URL=<RAG_SERVER_URL> \
--set backendEnvVars.RAG_INGEST_URL=<INGESTOR_SERVER_URL> -n aiq

Instruct LLM profile selection

By default, the deployment of the instruct LLM automatically selects the most suitable profile from the list of compatible profiles based on the detected hardware. If you encounter issues with the selected profile or prefer to use a different compatible profile, you can explicitly select the profile by adding the NIM_MODEL_PROFILE environment variable to the nim-llm section in values.yaml.

You can list available profiles by running the NIM container directly:

USERID=$(id -u) docker run --rm --gpus all \
  nvcr.io/nim/meta/llama-3.3-70b-instruct:1.14.0 \
  list-model-profiles

Using the list of model profiles from the previous step, add the NIM_MODEL_PROFILE in the nim-llm section of the values.yaml. It is ideal to select one of the tensorrt_llm profiles for best performance. Here is an example of selecting one of these profiles for two H100 GPUs:

nim-llm:
  enabled: true
  resources:
    limits:
      nvidia.com/gpu: 2
    requests:
      nvidia.com/gpu: 2
  env:  # Add this section
    - name: NIM_MODEL_PROFILE
      value: "tensorrt_llm-h100-fp8-tp2-pp1-throughput-2330:10de-0013e870ea929584ec13dad6948450024cdc6c2f03a865f1b050fb08b9f64312-2"
  model:
    ngcAPIKey: ""
    name: "meta/llama-3.3-70b-instruct"

If using A100s, the nim-llm section will also have to be updated to allocate four GPUs instead of two:

resources:
  limits:
    nvidia.com/gpu: 4
  requests:
    nvidia.com/gpu: 4
Hardware-Specific Profiles

The following tensorrt_llm profiles are optimized for different common GPU configurations:

2xH100 NVL
tensorrt_llm-h100_nvl-fp8-tp2-pp1-throughput-2321:10de-3035d73242fb579040fb3f341adc36a7073f780419e73dd97edb7ce35cb0f550-2
2xH100 SXM
tensorrt_llm-h100-fp8-tp2-pp1-throughput-2330:10de-0013e870ea929584ec13dad6948450024cdc6c2f03a865f1b050fb08b9f64312-2
4xA100
tensorrt_llm-a100-bf16-tp4-pp1-throughput-20b2:10de-f14e1bad1a0e78da150aeedfee7919ab3ef21def09825caffef460b93fdde9b7-4
2xRTX PRO 6000
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp2-pp1-throughput-2bb5:10de-77ab630b949b0a58ad580a22ea055bc392a30fbf57357d6398814e00775aab8c-2
2xB200
tensorrt_llm-b200-bf16-tp2-pp1-throughput-2901:10de-6d1452af26f860b53df112c90f6b92f22a41156c09dafa2582c2c1194e56a673-2

More information about model profile selection can be found here in the NVIDIA NIM for Large Language Models (LLMs) documentation.

Check status of pods

kubectl get pods -n aiq

Response should look like this:

NAME                                      READY   STATUS             RESTARTS       AGE
aiq-aira-aira-backend-5797589756-td5b2    1/1     Running            0              5m
aiq-aira-aira-frontend-74ff7cc5c8-wf9jx   1/1     Running            0              5m
aiq-aira-nim-llm-0                        1/1     Running            0              5m
aiq-aira-phoenix-78fd7584b7-s9bwc         1/1     Running            0              5m

Access to UI

Since the frontend service has a nodePort configured for port 30080, you can view the UI from a web browser on the host running kubectl at http://localhost:30080.

The UI can also be viewed from outside the cluster at: http://<cluster-node-name-or-ip>:30080

Create Default Collections

The AI-Q NVIDIA Research Assistant demo web application requires two default collections. One collection supports a biomedical research prompt and contains reports on Cystic Fibrosis. The second supports a financial research prompt and contains public financial documents from Alphabet, Meta, and Amazon.

Follow the steps in Bulk Upload via Python to create these default collections.

Stopping Services

To stop all services, run the following commands:

  1. Delete the AIRA deployment:
helm delete aiq-aira -n aiq
  1. Delete the RAG deployment:
helm delete rag -n rag
  1. Delete the namespaces:
kubectl delete namespace aiq
kubectl delete namespace rag