Skip to content

No Prometheus/OpenTelemetry metrics — operators cannot monitor proxy behavior over time #72

@avelino

Description

@avelino

Problem

The proxy exposes a GET /health endpoint that returns basic status information:

{
  "status": "ok",
  "backends_configured": 3,
  "backends_connected": 2,
  "active_clients": 1,
  "tools": 42,
  "version": "0.4.3"
}

This is sufficient for liveness checks, but provides no observability into the proxy's runtime behavior over time. In a production Kubernetes deployment, operators need to answer questions like:

Questions that cannot be answered today

  1. Throughput — How many tool calls per second is the proxy handling? What's the breakdown by backend? By tool? By identity?
  2. Latency — What's the p50/p95/p99 latency for tool calls? Which backends are slow? Are latencies increasing over time?
  3. Error rates — What percentage of tool calls are failing? Is the error rate spiking? Which backends have the highest error rates?
  4. Backend pool health — How often are backends being connected/disconnected by the idle reaper? How long do backend connections live? How often does lazy reconnection happen?
  5. Connection management — How many SSE/Streamable HTTP sessions are active? What's the connection churn rate? How often do TCP keepalive timeouts fire?
  6. Resource usage — How large is the in-memory tool cache? How many child processes are running (for stdio backends)?
  7. Auth/ACL — How many requests are rejected by ACL? What's the breakdown by identity and denied tool?

Why this matters

The proxy is designed to be shared infrastructure — a single proxy serving multiple AI clients. The /health endpoint is a point-in-time snapshot with no history, no aggregation, and no percentiles. Operators relying on it can only tell "is it up right now?" but not "is it degrading?" or "should I scale?"

The standard solution in the Kubernetes ecosystem is a Prometheus-compatible /metrics endpoint exposing counters, gauges, and histograms that are scraped by Prometheus/Grafana, Datadog, or similar monitoring stacks.

Data sources already in the code

The proxy already tracks much of this data internally but doesn't expose it:

  • Audit log — records every tool call with duration, success, identity, server, tool name (src/audit.rs)
  • Backend pool — tracks connected/configured/idle backends with timestamps (src/serve.rs)
  • Active clients — maintained as a counter in AppState (src/serve.rs)
  • Tool cache — holds all discovered tools with their server associations (src/serve.rs)
  • ACL decisions — grant/deny/classify happen on every request (src/server_auth/)
  • Idle reaper — runs every 30s and logs disconnections (src/serve.rs)

The data exists — it's just not exposed in a scrapeable format.

Related issues

Expected behavior

The proxy should expose runtime metrics in a format consumable by standard monitoring infrastructure (Prometheus, OpenTelemetry, or similar), covering at minimum: request throughput, latency distribution, error rates, and backend pool status.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or improvementinfrastructureDocker, Kubernetes, deploymentobservabilityAudit logs, monitoring, debuggingperformanceLatency, throughput, resource usageproxyServe/proxy mode (mcp serve)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions