Skip to content

Commit 28f4c74

Browse files
authored
Honor gateway tool execution budgets for HTTP backends by removing hardcoded 120s transport cap (#3911)
HTTP MCP backend `tools/call` requests were effectively capped at ~120s due to a hardcoded `http.Client.Timeout`, even when gateway/tool budgets were configured higher. This made `toolTimeout` non-authoritative for long-running HTTP backend calls and surfaced misleading transport timeout failures. - **Timeout model correction (HTTP backend transport)** - Removed the hardcoded `http.Client.Timeout` from `NewHTTPConnection`. - Kept `connect_timeout` behavior for connection/setup phases (`DialContext`, `ResponseHeaderTimeout`). - End-to-end request duration is now governed by request context deadlines (including gateway `toolTimeout` budgets), not a fixed transport ceiling. - **Behavioral test update** - Updated HTTP connection timeout test to assert that client-level overall timeout is unset (`0`) instead of `120s`, matching context-driven execution timeout semantics. - **Docs clarification** - Clarified that `connect_timeout` is per-attempt transport setup/fallback timeout. - Clarified that `toolTimeout` governs tool execution budget for backend requests, including HTTP backends. ```go httpClient := &http.Client{ Transport: &http.Transport{ DialContext: (&net.Dialer{ Timeout: connectTimeout, }).DialContext, ResponseHeaderTimeout: connectTimeout, }, } ``` > [!WARNING] > > <details> > <summary>Firewall rules blocked me from connecting to one or more addresses (expand for details)</summary> > > #### I tried to connect to the following addresses, but was blocked by firewall rules: > > - `example.com` > - Triggering command: `/tmp/go-build915870139/b514/launcher.test /tmp/go-build915870139/b514/launcher.test -test.testlogfile=/tmp/go-build915870139/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s .cfg�� ache/go/1.25.8/x-errorsas 64/src/text/temp-ifaceassert x_amd64/vet -p vendor/golang.or-o -lang=go1.25 x_amd64/vet om/s�� .cfg -I x_amd64/vet --gdwarf-5 7082204/b245/ -o x_amd64/vet` (dns block) > - Triggering command: `/tmp/go-build1119629889/b514/launcher.test /tmp/go-build1119629889/b514/launcher.test -test.testlogfile=/tmp/go-build1119629889/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s EN3 io.containerd.ru/run/containerd/io.containerd.runtime.v2.task/moby/f51a7715e232ec8a9315d8b288126git x_amd64/vet /usr/bin/bash by/ae163377ad462bash` (dns block) > - `invalid-host-that-does-not-exist-12345.com` > - Triggering command: `/tmp/go-build915870139/b496/config.test /tmp/go-build915870139/b496/config.test -test.testlogfile=/tmp/go-build915870139/b496/testlog.txt -test.paniconexit0 -test.timeout=10m0s /tmp/go-build915870139/b396/vet.cfg @v1.1.3/cpu/cpu.go om/tetratelabs/wazero@v1.11.0/internal/platform/google.golang.org/grpc/internal/serviceconfig x_amd64/vet -pthread ternal/engine/wa-atomic =0 x_amd64/vet Db3z�� 7082204/b252/_pk-errorsas 7082204/b151/ x_amd64/vet abis telabs/wazero/in-atomic p=/opt/hostedtoo-bool x_amd64/vet` (dns block) > - `nonexistent.local` > - Triggering command: `/tmp/go-build915870139/b514/launcher.test /tmp/go-build915870139/b514/launcher.test -test.testlogfile=/tmp/go-build915870139/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s .cfg�� ache/go/1.25.8/x-errorsas 64/src/text/temp-ifaceassert x_amd64/vet -p vendor/golang.or-o -lang=go1.25 x_amd64/vet om/s�� .cfg -I x_amd64/vet --gdwarf-5 7082204/b245/ -o x_amd64/vet` (dns block) > - Triggering command: `/tmp/go-build1119629889/b514/launcher.test /tmp/go-build1119629889/b514/launcher.test -test.testlogfile=/tmp/go-build1119629889/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s EN3 io.containerd.ru/run/containerd/io.containerd.runtime.v2.task/moby/f51a7715e232ec8a9315d8b288126git x_amd64/vet /usr/bin/bash by/ae163377ad462bash` (dns block) > - `slow.example.com` > - Triggering command: `/tmp/go-build915870139/b514/launcher.test /tmp/go-build915870139/b514/launcher.test -test.testlogfile=/tmp/go-build915870139/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s .cfg�� ache/go/1.25.8/x-errorsas 64/src/text/temp-ifaceassert x_amd64/vet -p vendor/golang.or-o -lang=go1.25 x_amd64/vet om/s�� .cfg -I x_amd64/vet --gdwarf-5 7082204/b245/ -o x_amd64/vet` (dns block) > - Triggering command: `/tmp/go-build1119629889/b514/launcher.test /tmp/go-build1119629889/b514/launcher.test -test.testlogfile=/tmp/go-build1119629889/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s EN3 io.containerd.ru/run/containerd/io.containerd.runtime.v2.task/moby/f51a7715e232ec8a9315d8b288126git x_amd64/vet /usr/bin/bash by/ae163377ad462bash` (dns block) > - `this-host-does-not-exist-12345.com` > - Triggering command: `/tmp/go-build915870139/b523/mcp.test /tmp/go-build915870139/b523/mcp.test -test.testlogfile=/tmp/go-build915870139/b523/testlog.txt -test.paniconexit0 -test.timeout=10m0s 64/s�� .cfg -I x_amd64/vet --gdwarf-5 .io/otel -o x_amd64/vet .cfg�� /oidc/errors.go /oidc/provider.go x_amd64/vet . g/grpc/internal/-qE --64 x_amd64/vet` (dns block) > - Triggering command: `/tmp/go-build1119629889/b523/mcp.test /tmp/go-build1119629889/b523/mcp.test -test.testlogfile=/tmp/go-build1119629889/b523/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� =0 y /usr/bin/docker ntime.v2.task/mogit` (dns block) > > If you need me to access, download, or install something from one of these locations, you can either: > > - Configure [Actions setup steps](https://gh.io/copilot/actions-setup-steps) to set up my environment, which run before the firewall is enabled > - Add the appropriate URLs or hosts to the custom allowlist in this repository's [Copilot coding agent settings](https://github.com/github/gh-aw-mcpg/settings/copilot/coding_agent) (admins only) > > </details>
2 parents 85562b1 + 32b8c76 commit 28f4c74

6 files changed

Lines changed: 96 additions & 17 deletions

File tree

docs/CONFIGURATION.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ Run `./awmg --help` for full CLI options. Key flags:
172172
- Enables per-server DIFC guard assignment independent of `guard-policies`
173173
- Example: `guard = "github"` (uses the guard named `github` from `[guards.github]`)
174174

175-
- **`connect_timeout`** (optional, HTTP servers only): Per-transport connection timeout in seconds for connecting to HTTP backends. The gateway tries streamable HTTP, then SSE, then plain JSON-RPC over HTTP POST in sequence; this timeout applies to each attempt. Default: `30`.
175+
- **`connect_timeout`** (optional, HTTP servers only): Per-transport connection timeout in seconds for connecting to HTTP backends. The gateway tries streamable HTTP, then SSE, then plain JSON-RPC over HTTP POST in sequence; this timeout applies to each attempt. It does **not** set the end-to-end `tools/call` execution timeout. Default: `30`.
176176

177177
- **`rate_limit_threshold`** (optional, TOML/JSON file configs only): Number of consecutive rate-limit errors from this backend that will trip the circuit breaker (transition CLOSED → OPEN). When OPEN, requests are immediately rejected until the breaker is eligible to transition to HALF-OPEN again; this is normally controlled by `rate_limit_cooldown`, but if the gateway knows an upstream rate-limit reset time (for example from response headers or parsed tool error text), that reset time takes precedence. **Not available in JSON stdin format.** Default: `3`.
178178

@@ -372,7 +372,7 @@ The `customSchemas` top-level field allows you to define custom server types bey
372372
| `apiKey` | API key for authentication | (disabled) |
373373
| `domain` | Gateway domain (`"localhost"`, `"host.docker.internal"`, or `"${VAR}"`) | (unset) |
374374
| `startupTimeout` | Seconds to wait for backend startup | `30` |
375-
| `toolTimeout` | Seconds to wait for tool execution | `60` |
375+
| `toolTimeout` | Maximum seconds for a single tool call, enforced as a context deadline on all backend requests (stdio and HTTP) | `60` |
376376
| `payloadDir` | Directory for large payload files | `/tmp/jq-payloads` |
377377
| `payloadSizeThreshold` (JSON) / `payload_size_threshold` (TOML) | Size threshold in bytes; responses larger than this are stored to disk and returned as a `payloadPath` reference | `524288` (512 KB) |
378378
| `trustedBots` (JSON) / `trusted_bots` (TOML) | Optional list of additional bot usernames to trust with "approved" integrity level. Additive to the built-in trusted bot list. When specified, must be a non-empty array with non-empty string entries (spec §4.1.3.4); omit the field entirely if not needed. Example: `["my-bot[bot]", "org-automation"]` | (disabled) |

internal/mcp/connection.go

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -204,13 +204,11 @@ func NewHTTPConnection(ctx context.Context, serverID, url string, headers map[st
204204
logger.LogInfo("backend", "Creating HTTP MCP connection with transport fallback, url=%s, connectTimeout=%v", url, connectTimeout)
205205
ctx, cancel := context.WithCancel(ctx)
206206

207-
// Create an HTTP client with appropriate timeouts.
208-
// Keep the existing overall request timeout, but also apply connectTimeout to
209-
// the underlying HTTP transport so plain JSON-RPC fallback attempts honor the
210-
// configured per-attempt connection timeout instead of waiting for the full
211-
// client timeout.
207+
// Create an HTTP client with connect/setup timeouts.
208+
// Do not set http.Client.Timeout: request execution should be controlled by
209+
// per-request context deadlines (for example the gateway tool timeout budget),
210+
// while connectTimeout continues to bound transport establishment/fallback.
212211
httpClient := &http.Client{
213-
Timeout: 120 * time.Second, // Overall request timeout
214212
Transport: &http.Transport{
215213
DialContext: (&net.Dialer{
216214
Timeout: connectTimeout,

internal/mcp/http_connection_test.go

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -483,13 +483,13 @@ func TestNewHTTPConnection_NilHeaders(t *testing.T) {
483483
assert.Equal(t, testServer.URL, conn.GetHTTPURL())
484484
}
485485

486-
// TestNewHTTPConnection_HTTPClientTimeout tests that HTTP client timeout is properly configured
487-
func TestNewHTTPConnection_HTTPClientTimeout(t *testing.T) {
486+
// TestNewHTTPConnection_HTTPClientTimeoutUnset tests that overall client timeout is unset
487+
// so tool execution budgets are controlled by request context deadlines.
488+
func TestNewHTTPConnection_HTTPClientTimeoutUnset(t *testing.T) {
488489
require := require.New(t)
489490

490-
// Create test server with delayed response (but not too long to hit default 120s timeout)
491+
// Create test server with delayed response.
491492
testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
492-
// Small delay to verify timeout handling works
493493
time.Sleep(50 * time.Millisecond)
494494

495495
response := map[string]interface{}{
@@ -516,8 +516,9 @@ func TestNewHTTPConnection_HTTPClientTimeout(t *testing.T) {
516516
require.NotNil(conn)
517517
defer conn.Close()
518518

519-
// Verify HTTP client has proper timeout set
520-
assert.Equal(t, 120*time.Second, conn.httpClient.Timeout, "HTTP client should have 120s timeout")
519+
// Verify overall HTTP client timeout is unset. End-to-end request budgets
520+
// should come from context deadlines, not a hardcoded client timeout.
521+
assert.Zero(t, conn.httpClient.Timeout)
521522
}
522523

523524
// TestNewHTTPConnection_ConnectionRefused tests handling of connection refused errors

internal/server/call_backend_tool_test.go

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ import (
77
"net/http/httptest"
88
"strings"
99
"testing"
10+
"time"
1011

1112
"github.com/github/gh-aw-mcpg/internal/config"
1213
sdk "github.com/modelcontextprotocol/go-sdk/mcp"
@@ -484,3 +485,72 @@ func TestCallBackendTool_AllowedToolsError_MessageFormat(t *testing.T) {
484485
assert.True(t, strings.Contains(text, `"blocked"`), "error message should include tool name: %s", text)
485486
assert.True(t, strings.Contains(text, "allowed-tools"), "error message should mention allowed-tools: %s", text)
486487
}
488+
489+
// TestCallBackendTool_ToolTimeoutEnforcedViaContext verifies that the configured
490+
// toolTimeout is applied as a context deadline, causing slow backend calls to fail
491+
// with a deadline-exceeded error instead of hanging indefinitely.
492+
func TestCallBackendTool_ToolTimeoutEnforcedViaContext(t *testing.T) {
493+
require := require.New(t)
494+
495+
// Create a slow backend that delays longer than our tool timeout
496+
backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
497+
var req map[string]interface{}
498+
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
499+
w.WriteHeader(http.StatusBadRequest)
500+
return
501+
}
502+
503+
method, _ := req["method"].(string)
504+
switch method {
505+
case "initialize":
506+
json.NewEncoder(w).Encode(map[string]interface{}{
507+
"jsonrpc": "2.0", "id": req["id"],
508+
"result": map[string]interface{}{
509+
"protocolVersion": "2024-11-05",
510+
"capabilities": map[string]interface{}{},
511+
"serverInfo": map[string]interface{}{"name": "slow-backend", "version": "1.0.0"},
512+
},
513+
})
514+
case "tools/list":
515+
json.NewEncoder(w).Encode(map[string]interface{}{
516+
"jsonrpc": "2.0", "id": req["id"],
517+
"result": map[string]interface{}{"tools": []map[string]interface{}{}},
518+
})
519+
case "tools/call":
520+
// Simulate a slow tool: sleep longer than the configured toolTimeout.
521+
// The actual tool call should return after ~1s (timeout), but the
522+
// httptest.Server cleanup waits for this goroutine to finish.
523+
time.Sleep(3 * time.Second)
524+
json.NewEncoder(w).Encode(map[string]interface{}{
525+
"jsonrpc": "2.0", "id": req["id"],
526+
"result": map[string]interface{}{
527+
"content": []map[string]interface{}{{"type": "text", "text": "should not reach here"}},
528+
},
529+
})
530+
}
531+
}))
532+
defer backend.Close()
533+
534+
// Configure with a very short toolTimeout (1 second)
535+
cfg := &config.Config{
536+
Gateway: &config.GatewayConfig{
537+
ToolTimeout: 1,
538+
},
539+
Servers: map[string]*config.ServerConfig{
540+
"slow": {Type: "http", URL: backend.URL},
541+
},
542+
}
543+
544+
us, err := NewUnified(context.Background(), cfg)
545+
require.NoError(err)
546+
defer us.Close()
547+
548+
ctx := context.WithValue(context.Background(), SessionIDContextKey, "timeout-test")
549+
result, _, callErr := us.callBackendTool(ctx, "slow", "slow_tool", map[string]interface{}{})
550+
551+
// The call should fail due to context deadline exceeded
552+
require.Error(callErr, "Tool call should fail due to timeout")
553+
require.NotNil(result, "Should return a CallToolResult even on timeout")
554+
require.True(result.IsError, "Result should be marked as error")
555+
t.Logf("Tool call correctly timed out: %v", callErr)
556+
}

internal/server/unified.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -466,6 +466,16 @@ func (us *UnifiedServer) callBackendTool(ctx context.Context, serverID, toolName
466466
// The closure captures the request and validates before calling this method
467467
logUnified.Printf("callBackendTool: serverID=%s, toolName=%s, args=%+v", serverID, toolName, args)
468468

469+
// Apply the configured tool timeout as a context deadline so backend calls
470+
// (including HTTP backends) are bounded by toolTimeout rather than hanging
471+
// indefinitely. This is the primary enforcement point for the gateway's
472+
// tool execution budget.
473+
if us.cfg != nil && us.cfg.Gateway != nil && us.cfg.Gateway.ToolTimeout > 0 {
474+
var cancel context.CancelFunc
475+
ctx, cancel = context.WithTimeout(ctx, time.Duration(us.cfg.Gateway.ToolTimeout)*time.Second)
476+
defer cancel()
477+
}
478+
469479
// Start an OTEL span for the full tool call lifecycle (spans all phases 0–6)
470480
// Attribute names follow MCP Gateway Specification §4.1.3.6
471481
ctx, toolSpan := us.getTracer().Start(ctx, "mcp.tool_call",

test/integration/http_error_test.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,10 +79,10 @@ func TestHTTPError_ClientError(t *testing.T) {
7979
}
8080

8181
// TestHTTPError_ConnectionTimeout tests that connection timeouts are properly handled
82-
// This test is SKIPPED by default because it takes a long time (the HTTP client has a 120s timeout).
82+
// This test is SKIPPED by default because it takes a long time.
8383
// The timeout behavior is tested implicitly in other tests that use context timeouts.
8484
func TestHTTPError_ConnectionTimeout(t *testing.T) {
85-
t.Skip("Skipping timeout test - takes too long due to HTTP client 120s timeout. Other tests verify context-based timeouts work correctly.")
85+
t.Skip("Skipping timeout test - takes too long. Other tests verify context-based timeouts work correctly.")
8686

8787
if testing.Short() {
8888
t.Skip("Skipping integration test in short mode")
@@ -95,7 +95,7 @@ func TestHTTPError_ConnectionTimeout(t *testing.T) {
9595
if f, ok := w.(http.Flusher); ok {
9696
f.Flush()
9797
}
98-
// Sleep to cause a read timeout (HTTP client has 120s timeout)
98+
// Sleep to cause a read timeout
9999
// But we'll use a shorter context timeout to cut it off sooner
100100
time.Sleep(150 * time.Second)
101101
}))

0 commit comments

Comments
 (0)