This guide describes how to deploy the AI-Q Research Assistant using Docker.
-
This blueprint depends on the NVIDIA RAG blueprint. The deployment guide includes instructions for deploying RAG using docker compose, but please consult the latest RAG documentation as well. The RAG blueprint requires NVIDIA NIM microservices that are either running on-premise or hosted by NVIDIA, including the Nemo Retriever microservices and LLM, by default Llama 3.3 Nemotron Super 49B. For a self-contained local deployment, 2xH100, 3xA100, 3xB200 or 2xRTX PRO 6000 GPUs are required.
-
In addition to the LLM used by RAG, Llama 3.3 Nemotron Super 49B (llama-3_3-nemotron-super-49b-v1_5), the AI-Q Research Assistant also requires access to the Llama 3.3 Instruct 70B (llama-3.3-70b-instruct) model. Deploying this model requires an additional 2xB200, 2xH100 GPUs, 4xA100 GPUs or 2xRTX PRO 6000 GPUs.
-
Docker Compose
-
NVIDIA Container Toolkit
-
(Optional) This blueprint supports Tavily web search to supplement data from RAG. A Tavily API key can be supplied to enable this function.
For a self-contained local deployment
- 5 B200 GPUs or 4 H100 GPUs with 80GB of memory each or 7 A100 GPUs with 80GB of memory each or 4 RTX PRO 6000 GPUs with 96GB of memory each
For a deployment using hosted NVIDIA NIM microservices No GPUs are required
Access the following NVIDIA NIM microservices
- NemoRetriever
- Page Elements
- Table Structure
- Graphic Elements
- Paddle OCR
- Llama Instruct 3.3 70B
- Llama Nemotron 3.3 Super 49B
This section demonstrates how to deploy AI-Q Research Assistant.
Clone the aiq-research-assistant and set it as the working directory:
git clone https://github.com/NVIDIA-AI-Blueprints/aiq-research-assistant.git
cd aiq-research-assistantStart by setting the required environment variables:
export NVIDIA_API_KEY=nvapi-your-nvidia-api-key
export NGC_API_KEY=$NVIDIA_API_KEY
export TAVILY_API_KEY=your-tavily-api-key
export USERID=$(id -u)Login to the NVIDIA Container Registry:
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdinCreate a model cache directory:
mkdir -p ~/.cache/model-cache
export MODEL_DIRECTORY=~/.cache/model-cacheBefore deploying the AI-Q Research Assistant, deploy RAG by following these instructions.
git clone https://github.com/NVIDIA-AI-Blueprints/rag.git -b mainOpen the file rag/deploy/compose/.env and confirm that all of the values in the section # ==== Endpoints for using cloud NIMs === are commented out. Then source this file:
source rag/deploy/compose/.envDeploy the RAG NVIDIA NIM microservices, including the LLM. This step can take up to 45 minutes.
docker compose -f rag/deploy/compose/nims.yaml up -dFor A100/B200 system, run the following commands
export LLM_MS_GPU_ID=1,2
docker compose -f rag/deploy/compose/nims.yaml up -dTIP: You can watch the status with watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'.
To confirm that the deployment is successful, run docker ps --format "table {{.Names}}\t{{.Status}}", you should see:
NAMES STATUS
nemoretriever-ranking-ms Up 14 minutes (healthy)
compose-page-elements-1 Up 14 minutes
compose-paddle-1 Up 14 minutes
compose-graphic-elements-1 Up 14 minutes
compose-table-structure-1 Up 14 minutes
nemoretriever-embedding-ms Up 14 minutes (healthy)
nim-llm-ms Up 14 minutes (healthy)
Deploy the Vector DB:
export VECTORSTORE_GPU_DEVICE_ID=0
docker compose -f rag/deploy/compose/vectordb.yaml up -dTo confirm that the deployment was successful, run docker ps --format "table {{.Names}}\t{{.Status}}". In addition to the previously running containers, you should see:
milvus-standalone Up 2 minutes
milvus-minio Up 2 minutes (healthy)
milvus-etcd Up 2 minutes (healthy)
Deploy the ingestion server:
docker compose -f rag/deploy/compose/docker-compose-ingestor-server.yaml up -dTo confirm that the deployment was successful, run docker ps --format "table {{.Names}}\t{{.Status}}". In addition to the previously running containers, you should see:
compose-redis-1 Up 3 minutes
compose-nv-ingest-ms-runtime-1 Up 3 minutes (healthy)
ingestor-server Up 3 minutes
Deploy the RAG server:
docker compose -f rag/deploy/compose/docker-compose-rag-server.yaml up -dTo confirm that the deployment was successful, run docker ps --format "table {{.Names}}\t{{.Status}}". In addition to the previously running containers, you should see:
rag-frontend Up 4 minutes
rag-server Up 4 minutes
Next deploy the instruct model. This step can take up to 45 minutes.
By default, the deployment of the instruct LLM automatically selects the most suitable profile from the list of compatible profiles based on the detected hardware. If you encounter issues with the selected profile or prefer to use a different compatible profile, you can explicitly select the profile by setting the NIM_MODEL_PROFILE environment variable before deploying.
You can list available profiles by running the NIM container directly:
USERID=$(id -u) docker run --rm --gpus all \
nvcr.io/nim/meta/llama-3.3-70b-instruct:1.14.0 \
list-model-profilesUsing the list of model profiles from the previous step, set the NIM_MODEL_PROFILE. It is ideal to select one of the tensorrt_llm profiles for best performance. Here is an example of selecting one of these profiles for two H100 GPUs:
export NIM_MODEL_PROFILE="tensorrt_llm-h100-fp8-tp2-pp1-throughput-2330:10de-0013e870ea929584ec13dad6948450024cdc6c2f03a865f1b050fb08b9f64312-2"Then update deploy/compose/docker-compose.yaml to add the NIM_MODEL_PROFILE environment variable to the aira-instruct-llm service environment section:
aira-instruct-llm:
container_name: aira-instruct-llm
image: nvcr.io/nim/meta/llama-3.3-70b-instruct:1.14.0
# ... other configuration ...
environment:
NGC_API_KEY: ${NGC_API_KEY}
NIM_MODEL_PROFILE: ${NIM_MODEL_PROFILE-""} # Add this lineThe following tensorrt_llm profiles are optimized for different common GPU configurations:
tensorrt_llm-h100_nvl-fp8-tp2-pp1-throughput-2321:10de-3035d73242fb579040fb3f341adc36a7073f780419e73dd97edb7ce35cb0f550-2
tensorrt_llm-h100-fp8-tp2-pp1-throughput-2330:10de-0013e870ea929584ec13dad6948450024cdc6c2f03a865f1b050fb08b9f64312-2
tensorrt_llm-a100-bf16-tp4-pp1-throughput-20b2:10de-f14e1bad1a0e78da150aeedfee7919ab3ef21def09825caffef460b93fdde9b7-4
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp2-pp1-throughput-2bb5:10de-77ab630b949b0a58ad580a22ea055bc392a30fbf57357d6398814e00775aab8c-2
tensorrt_llm-b200-bf16-tp2-pp1-throughput-2901:10de-6d1452af26f860b53df112c90f6b92f22a41156c09dafa2582c2c1194e56a673-2
More information about model profile selection can be found here in the NVIDIA NIM for Large Language Models (LLMs) documentation.
For A100 system, run the following command:
export AIRA_LLM_MS_GPU_ID=3,4,5,6For B200 system, run the following command:
export AIRA_LLM_MS_GPU_ID=3,4Run the following to deploy the model:
docker compose -f deploy/compose/docker-compose.yaml --profile aira-instruct-llm up -dTIP: you can watch the status with watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'.
To confirm that the deployment was successful, run docker ps --format "table {{.Names}}\t{{.Status}}". In addition to the previously running containers, you should see:
aira-instruct-llm Up 5 minutes (healthy)
This step deploys the AIRA backend, AIRA proxy, and the pre-built AIRA demo frontend. The AIRA demo frontend is provided as a pre-built docker container containing a fully functional web application. The source code for this web application is not distributed.
docker compose -f deploy/compose/docker-compose.yaml --profile aira up -dTo confirm that the deployment was successful, run docker ps --format "table {{.Names}}\t{{.Status}}". In addition to the previously running containers, you should see:
aira-frontend Up 2 minutes
aira-backend Up 2 minutes
You can then view the web UI at:
localhost:3000
The backend will be running and visible at:
localhost:3838/docs
The AI-Q Research Assistant demo web application requires two default collections. One collection supports a biomedical research prompt and contains reports on Cystic Fibrosis. The second supports a financial research prompt and contains public financial documents from Alphabet, Meta, and Amazon. To pre-populate RAG with these two collections, run:
docker run \
-e RAG_INGEST_URL=http://ingestor-server:8082/v1 \
-e PYTHONUNBUFFERED=1 \
-v /tmp:/tmp-data \
--network nvidia-rag \
nvcr.io/nvidia/blueprint/aira-load-files:v1.2.0This command will populate the default collections with sample documents. Note that this process can take up to 60 minutes to complete, during which time manual uploads from the frontend may not work properly.
Troubleshooting tips if the default collection creation fails:
- If you did not deploy RAG via docker compose, you will need to replace these values in the docker run command above:
http://ingestor-server:8082/v1: replace with your RAG ingestor server address- remove the line
--network nvidia-rag
- If you get an error that the zip file is not a valid zip file
Install git LFS for your platform, eg sudo apt-get install git-lfs and then run:
git lfs install
git lfs pullTo stop all services, run the following commands in order:
- Stop the AI-Q Research Assistant services:
docker compose -f deploy/compose/docker-compose.yaml --profile aira down- Stop the instruct model:
docker compose -f deploy/compose/docker-compose.yaml --profile aira-instruct-llm down- Stop the RAG services:
docker compose -f rag/deploy/compose/docker-compose-rag-server.yaml down
docker compose -f rag/deploy/compose/docker-compose-ingestor-server.yaml down
docker compose -f rag/deploy/compose/vectordb.yaml down
docker compose -f rag/deploy/compose/nims.yaml down- Remove the cache directories used by the RAG vector database and minio service:
rm -rf rag/deploy/compose/volumes/minioTip: If you retain these directories, the collections you created will remain the next time you start the services.
To verify all services have been stopped, run:
docker psIf you already have RAG deployed, skip to the next step.
To deploy using hosted NVIDIA NIM microservices, follow the instructions for deploying the RAG blueprint using hosted models.
Edit the AI-Q configuration file located at configs/config.yml.
Update the following values, leaving the rest of the file with the default values.
-
llms.instruct_llm.api_key: enter a NVIDIA API Key with access to required NVIDIA NIM microservices. Optional: update the model base_url and model_name if a different model is desired, such as an on-premise NVIDIA NIM microservice. Theinstruct_llmLLM is used for Q&A and report generation. An instruct model is recommended. -
llms.instruct_llm.base_url: update to https://integrate.api.nvidia.com/v1 -
llms.nemotron.api_key: enter a NVIDIA API Key with access to required NVIDIA NIM microservices. Optional: update the model base_url and model_name if a different model is desired, such as an on-premise NVIDIA NIM microservice. Theinstruct_llmLLM is used for report planning and reflection. A reasoning model is recommended. -
llms.nemotron.base_url: update to https://integrate.api.nvidia.com/v1 -
In the
functionssection, updatefunctions.generate_summary.rag_urlwith the full public IP address and port for therag-serverfrom the RAG deployment. This step is only required if you have deployed RAG on a separate server. If you have deployed RAG using docker compose on the same server as AIRA, leave the default value. -
In the
functionssection, updatefunctions.artifact_qa.rag_urlwith the full public IP address and port for therag-serverfrom the deployment. This step is only required if you have deployed RAG on a separate server. If you have deployed RAG using docker compose on the same server as AIRA, leave the default value.
Edit the Docker Compose file located at deploy/compose/docker-compose.yaml.
- Update the value
services.aira-backend.environment.TAVILY_API_KEYwith your TAVILY API Key - If you have deployed RAG on a different server than AIRA, update the value
services.aira-backend.environment.RAG_INGEST_URLwith the public http IP address of the RAG ingestor service such ashttp://UPDATE-TO-YOUR-RAG-IP-SERVER:8082. If you have deployed RAG using docker compose on the same server as AIRA, leave the default value.
WARNING: The rag ingest IP address must be resolvable outside the docker network, so addresses such as
localhostorrag-serverwill not work. Currently only http addresses are supported. HTTPS rag deployments, or authenticated RAG deployments, will require updates to the NGINX proxy.
# Uncomment and run this command if you have deployed RAG on a different server
# docker network create nvidia-rag
docker compose -f deploy/compose/docker-compose.yaml --profile aira up -dIf you encounter any issues during deployment or operation, please refer to the comprehensive Troubleshooting Guide for detailed solutions and debugging steps.
For detailed instructions on setting up Phoenix dashboard for OpenTelemetry tracing, please refer to Phoenix Tracing Configuration.