Skip to content

Conversation

@smoogipoo
Copy link
Contributor

@smoogipoo smoogipoo commented Nov 8, 2025

RFC

Resolves #35580
Resolves ppy/osu-server-spectator#193
Resolves #35586
Resolves ppy/osu-server-spectator#362

Outline

This enables stateful reconnect for spectator-server endpoints, allowing ConnectionIds to be preserved for a short period and messages to be replayed on reconnect.

In practice, this means short disconnects (<30s) should no longer:

  • Drop replays
  • Kick you out of multiplayer rooms
  • Trigger "user has come online" re-alerts.

The following video demonstrates two of the above:

2025-11-08.19-30-17.mp4

Stateful reconnect appears to kick in as long as the socket doesn't get disconnected, not on subsequent re-connections. We have the timeout period set to SignalR's default of 30sec.

I've been using the following to simulate a total link loss:

#!/bin/bash

DELAY=${1:-1}

echo "Conditioning for $DELAY seconds..."

sudo ip link set lo down
sudo ss -K dst 127.0.0.2 > /dev/null
sleep $DELAY
sudo ip link set lo up

@smoogipoo smoogipoo added the area:online functionality Deals with online fetching / sending but don't change much on a surface UI level. label Nov 8, 2025
@smoogipoo smoogipoo closed this Nov 8, 2025
@pull-request-size pull-request-size bot added size/XS and removed size/L labels Nov 8, 2025
@smoogipoo smoogipoo reopened this Nov 8, 2025
@pull-request-size pull-request-size bot added size/M and removed size/XS labels Nov 8, 2025
@pull-request-size pull-request-size bot added size/XS and removed size/M labels Nov 8, 2025
@smoogipoo smoogipoo self-assigned this Nov 8, 2025
@smoogipoo smoogipoo requested a review from a team November 8, 2025 11:40
@smoogipoo
Copy link
Contributor Author

smoogipoo commented Nov 12, 2025

On Discord I was asked about why the connection dies after 30sec.

Initially I thought this was just the socket keepalive period, but it appears that is set to 15 seconds. Regardless, I found that there's a second timeout which is that 30s window that we're concerned about, which can be adjusted with the following:

diff --git a/osu.Server.Spectator/Startup.cs b/osu.Server.Spectator/Startup.cs
index 3e326cc..028afd2 100644
--- a/osu.Server.Spectator/Startup.cs
+++ b/osu.Server.Spectator/Startup.cs
@@ -29,6 +29,7 @@ namespace osu.Server.Spectator
                     {
                         options.AddFilter<LoggingHubFilter>();
                         options.AddFilter<ConcurrentConnectionLimiter>();
+                        options.ClientTimeoutInterval = TimeSpan.FromMinutes(5);
                     })
                     .AddMessagePackProtocol(options =>
                     {
diff --git a/osu.Game/Online/HubClientConnector.cs b/osu.Game/Online/HubClientConnector.cs
index ff9a4261fd..c87ba0812c 100644
--- a/osu.Game/Online/HubClientConnector.cs
+++ b/osu.Game/Online/HubClientConnector.cs
@@ -72,6 +72,7 @@ protected override Task<PersistentEndpointClient> BuildConnectionAsync(Cancellat
                     options.Headers.Add(CLIENT_SESSION_ID_HEADER, API.SessionIdentifier.ToString());
                 });
 
+            builder.WithServerTimeout(TimeSpan.FromMinutes(5));
             builder.WithStatefulReconnect();
 
             builder.AddMessagePackProtocol(options =>

I haven't re-tested but I've been able to go up to 1-minute before. I don't know the implications.

@bdach
Copy link
Collaborator

bdach commented Nov 12, 2025

Empirical observations from local testing of what this "stateful reconnects" feature is, which I gathered myself because the docs are terrible.

Conditions of test:

  • Separate signalr project with 2 disparate hubs, with 2 operations in each hub
  • Client and server run on different PCs on local network
  • Link trouble & link loss simulated on client PC via Network Link Conditioner

Conclusions:

  • Async invocations of hub methods appear to block until acknowledgement is received from the server. Once reconnection occurs and the invocations succeed server-side, the hub methods unblock and program execution is continued.

    • Tested via full loss of link.
  • Order of invocations appears to be guaranteed at hub level. Invocations from a single client instance that span hub boundaries are not guaranteed to occur in the same order.

    • Tested via induced 50% packet loss. Order of messages was not preserved between client and server in general, but messages within a single hub were kept in the same order.
  • The reconnection works and preserves connection IDs as long as the original websocket doesn't go dead due to one of the relevant keepalive periods.

    • Without stateful reconnection, once an The remote party closed the WebSocket connection without completing the close handshake occurs (about 15s in), all subsequent SignalR operations fail instantly and can be essentially considered instantly dropped.
    • Without stateful reconnection but with automatic reconnection, after some time has passed after the above, SignalR operations go from instant-fail to full-blocking, block for a couple of seconds, and then go back to instant-fail. I'm not sure what this is caused by, but my hypothesis is that it's linked to the retry policy determining that it's time to retry connecting again.
    • With both stateful and automatic, some messages are still dropped. Client operations fully block during the connection failure, but then some fail after the connection failure is resolved due to the 30s server timeout.
    • Regardless of stateful reconnection, if some keepalive period expires (either socket keepalive or client timeout), then connectivity resumes, the client re-establishes the connection, but with a new set of connection IDs.
  • Both client and server use message buffers for this stateful reconnection. The size of this buffer is configurable on both sides, but seemingly not easily instrumentable to check how bad the utilisation of it is at any given time.

    I haven't been able to emprically exercise the effects of overrunning this buffer very well except for noticing that when I set it to an obscenely low amount like a byte the client stops doing anything, and that would check out with the implementation of this buffer that I found in ASP.NET source which says that "primitive backpressure" (i.e. fully blocking the relevant message until enough space is reclaimed or the connection is dropped) is utilised.

Long and short, I'm not sure spending more time on investigation here is useful at this stage, I'll reconsider that tomorrow. My immediate vibes on this are as follows:

  • This is probably better than the nothing that we have, but probably won't be as good as we'd hope it to be either
  • There are knobs we can tweak here, to some effect, that may improve how this works
  • There's also a giant danger sign on how primitive the buffer blocking logic looks, which may cause very bad blockages especially server-side (if there's anything else I'd want to investigate further it's this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:online functionality Deals with online fetching / sending but don't change much on a surface UI level. size/XS

Projects

None yet

2 participants