No Prometheus/OpenTelemetry metrics — operators cannot monitor proxy behavior over time

## Problem

The proxy exposes a `GET /health` endpoint that returns basic status information:

```json
{
  "status": "ok",
  "backends_configured": 3,
  "backends_connected": 2,
  "active_clients": 1,
  "tools": 42,
  "version": "0.4.3"
}
```

This is sufficient for liveness checks, but provides no observability into the proxy's runtime behavior over time. In a production Kubernetes deployment, operators need to answer questions like:

### Questions that cannot be answered today

1. **Throughput** — How many tool calls per second is the proxy handling? What's the breakdown by backend? By tool? By identity?
2. **Latency** — What's the p50/p95/p99 latency for tool calls? Which backends are slow? Are latencies increasing over time?
3. **Error rates** — What percentage of tool calls are failing? Is the error rate spiking? Which backends have the highest error rates?
4. **Backend pool health** — How often are backends being connected/disconnected by the idle reaper? How long do backend connections live? How often does lazy reconnection happen?
5. **Connection management** — How many SSE/Streamable HTTP sessions are active? What's the connection churn rate? How often do TCP keepalive timeouts fire?
6. **Resource usage** — How large is the in-memory tool cache? How many child processes are running (for stdio backends)?
7. **Auth/ACL** — How many requests are rejected by ACL? What's the breakdown by identity and denied tool?

### Why this matters

The proxy is designed to be shared infrastructure — a single proxy serving multiple AI clients. The `/health` endpoint is a point-in-time snapshot with no history, no aggregation, and no percentiles. Operators relying on it can only tell "is it up right now?" but not "is it degrading?" or "should I scale?"

The standard solution in the Kubernetes ecosystem is a Prometheus-compatible `/metrics` endpoint exposing counters, gauges, and histograms that are scraped by Prometheus/Grafana, Datadog, or similar monitoring stacks.

### Data sources already in the code

The proxy already tracks much of this data internally but doesn't expose it:

- **Audit log** — records every tool call with duration, success, identity, server, tool name (`src/audit.rs`)
- **Backend pool** — tracks connected/configured/idle backends with timestamps (`src/serve.rs`)
- **Active clients** — maintained as a counter in `AppState` (`src/serve.rs`)
- **Tool cache** — holds all discovered tools with their server associations (`src/serve.rs`)
- **ACL decisions** — grant/deny/classify happen on every request (`src/server_auth/`)
- **Idle reaper** — runs every 30s and logs disconnections (`src/serve.rs`)

The data exists — it's just not exposed in a scrapeable format.

### Related issues

- #68 — Kubernetes manifests should include ServiceMonitor/PodMonitor if metrics are available
- #69 — Helm chart values should include metrics port, scrape annotations, ServiceMonitor toggle
- #67 — metrics endpoint port/path should be configurable for container deployments

## Expected behavior

The proxy should expose runtime metrics in a format consumable by standard monitoring infrastructure (Prometheus, OpenTelemetry, or similar), covering at minimum: request throughput, latency distribution, error rates, and backend pool status.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Prometheus/OpenTelemetry metrics — operators cannot monitor proxy behavior over time #72

Problem

Questions that cannot be answered today

Why this matters

Data sources already in the code

Related issues

Expected behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

No Prometheus/OpenTelemetry metrics — operators cannot monitor proxy behavior over time #72

Description

Problem

Questions that cannot be answered today

Why this matters

Data sources already in the code

Related issues

Expected behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions