Skip to content

Add documentation for observability and port forwarding#129

Merged
AjayThorve merged 5 commits intoNVIDIA-AI-Blueprints:developfrom
pastorsj:feature/document_updates_telemetry_port_forwarding
Mar 10, 2026
Merged

Add documentation for observability and port forwarding#129
AjayThorve merged 5 commits intoNVIDIA-AI-Blueprints:developfrom
pastorsj:feature/document_updates_telemetry_port_forwarding

Conversation

@pastorsj
Copy link
Contributor

What does this PR do?

  • Adds documentation for adding observability platforms for AI-Q, including LangSmith, Weights and Biases, and Arize Phoenix
  • Adds documentation for port forwarding capabilities

Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR adds two new documentation sections to the AI-Q blueprint: a comprehensive Observability guide covering Phoenix, LangSmith, Weights & Biases Weave, the OTEL Collector with privacy redaction, and verbose logging; and a VM / Remote Development troubleshooting section explaining SSH port forwarding. Supporting changes include updating deploy/.env.example to consolidate observability-related environment variables into a dedicated block, trimming redundant tracing content from production.md, and wiring up the new pages in the Sphinx TOC and deployment index.

  • New docs/source/deployment/observability.md provides per-backend YAML config snippets, configuration reference tables, and "What You Can Inspect" summaries for all five backends.
  • New "VM / Remote Development" section in troubleshooting.md covers SSH port forwarding, VS Code Remote-SSH, and a common-symptoms table.
  • deploy/.env.example moves WANDB_API_KEY and adds LangSmith vars into a new Observability / Tracing section with a direct pointer to the new doc.
  • All previously flagged issues (package manager inconsistency, batch YAML context, missing OTEL reference-table fields, misleading LangSmith YAML block, ~C escape formatting) have been addressed in this revision.
  • One minor gap remains: the OTEL Batch Configuration section shows five fields in a code block without a reference table, leaving the units of time-based values (flush_interval, shutdown_timeout) ambiguous.

Confidence Score: 5/5

  • This is a documentation-only PR with no code changes; safe to merge.
  • All changes are Markdown documentation and a commented .env.example update. No logic, configuration defaults, or runtime behavior is modified. Previously flagged review issues have been addressed. The single remaining gap (missing units/descriptions for OTEL batch fields) is a minor style concern that does not block merging.
  • No files require special attention.

Important Files Changed

Filename Overview
docs/source/deployment/observability.md New 250-line observability guide covering Phoenix, LangSmith, Weave, OTEL Collector, and Verbose Logging. Previously flagged issues (uv vs pip, batch YAML context, missing OTEL table fields) are now addressed. One minor gap: batch configuration fields lack descriptions/units in a reference table.
docs/source/resources/troubleshooting.md New "VM / Remote Development" section added with SSH port forwarding instructions. The ~C escape sequence usage now correctly shows both -L flags on a single line, and a cross-reference to the new observability doc is included.
deploy/.env.example Observability / Tracing block added with LangSmith and W&B env vars. WANDB_API_KEY relocated from the Evaluation section to this new block, which is more appropriate.
docs/source/get-started/quick-start.md Adds a tip block pointing remote-VM users to SSH port forwarding guidance with a direct link to the new troubleshooting section.
docs/source/deployment/production.md Tracing subsection condensed to a single-sentence pointer to the new observability.md, removing duplication. Clean change.
docs/source/deployment/index.md Adds the new Observability page to the deployment section index. No issues.
docs/source/index.md Adds Observability entry to the Sphinx toctree between Docker Build System and Production. No issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[AI-Q Application] --> B{Observability Backend?}

    B -->|Phoenix| C[arize-phoenix server\nlocalhost:6006]
    B -->|LangSmith| D[LangSmith Cloud\nvia LANGCHAIN_* env vars]
    B -->|Weave| E[W&B Weave\nvia WANDB_API_KEY + YAML config]
    B -->|OTEL Collector| F[otelcollector_redaction exporter]
    B -->|Verbose| G[Console / stdout]

    F --> H{Redaction enabled?}
    H -->|Yes| I[Redact PII / sensitive attrs]
    H -->|No| J[Forward spans as-is]
    I --> K[OTEL Collector\nJaeger / Tempo / Datadog]
    J --> K

    C -->|Trace UI| L[localhost:6006 UI]
    D -->|Cloud dashboard| M[smith.langchain.com]
    E -->|Cloud dashboard| N[wandb.ai]
Loading

Last reviewed commit: 7ed9d79

AjayThorve
AjayThorve previously approved these changes Mar 10, 2026
Copy link
Collaborator

@AjayThorve AjayThorve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one type, otherwise looks great!

Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
AjayThorve
AjayThorve previously approved these changes Mar 10, 2026
Copy link
Collaborator

@AjayThorve AjayThorve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@AjayThorve AjayThorve merged commit ab629f8 into NVIDIA-AI-Blueprints:develop Mar 10, 2026
4 checks passed
Comment on lines +209 to +225
### Batch Configuration

The exporter supports standard OTEL batch settings:

```yaml
general:
telemetry:
tracing:
otel:
_type: otelcollector_redaction
endpoint: http://your-otel-collector:4318/v1/traces
batch_size: 512
flush_interval: 5000
max_queue_size: 2048
drop_on_overflow: false
shutdown_timeout: 30000
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch configuration fields undocumented — units unclear for time values

The "Batch Configuration" section shows five fields in a code block but none of them appear in a reference table. In particular, flush_interval: 5000 and shutdown_timeout: 30000 are ambiguous — a user unfamiliar with OTEL exporters cannot tell whether these are in milliseconds, seconds, or some other unit from the snippet alone.

Consider either extending the existing configuration reference table or adding a small table here:

Field Description
batch_size Maximum number of spans per export batch.
flush_interval Interval in milliseconds between automatic flushes (default: 5000 = 5 s).
max_queue_size Maximum number of spans held in the queue before exporting.
drop_on_overflow Whether to drop spans when the queue reaches max_queue_size.
shutdown_timeout Maximum time in milliseconds to wait for in-flight spans on shutdown (default: 30000 = 30 s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants