Add explicit `UpstairsState::Disabled` #1721

mkeeter · 2025-05-22T18:22:50Z

There's a weird phantom state that's been hiding in the Upstairs state machine; this PR makes it explicit.

In the handler for YouAreNoLongerActive and on_uuid_mismatch, the Upstairs performs a peculiar ritual:

        // Restart the state machine for this downstairs client
        self.downstairs.clients[client_id].disable(&self.state);
        self.set_inactive(CrucibleError::UuidMismatch);

The effects here are somewhat confusing:

Calling DownstairsClient::disable stops a client with ClientStopReason::Disabled. This stop reason is a special case – it means that the client does not try to reconnect when reinitialized. In all other cases, whether the client connects depends on the upstairs state alone.
Upstairs::set_inactive sets the upstairs state to Initializing. It does not do anything else – for example, it doesn't try to stop the other Downstairs.

The end result is that the problematic client is restarted, and does not connect to the Downstairs.

As we rethink the state machine for RFD 542, I'd like to remove this special case. In this PR:

Upstairs::set_inactive is renamed to Upstairs::set_disabled and now sets the upstairs state to a new UpstairsState::Disabled state
auto_promote: bool is removed from the negotiation state, because we now only depend on the upstairs state

I still think the semantics of the UpstairsState::Disabled are fuzzy and could use some ironing out, but this PR is meant to be a step in the right direction.

For example, going straight to Initializing without shutting down the other Downstairs seems bad! Once the upstairs is in Initializing, it will accept a GoActive request, which will hit this panic on the other Downstairs. This issue remains true after the PR (although the Upstairs will be in Disabled instead).

leftwo · 2025-05-22T18:58:25Z

The end result is that the problematic client is restarted, and does not connect to the Downstairs.

So, I believe this path was to handle what happens when we have a bad set of targets for the downstairs.
We don't believe the downstairs we have connected to is correct, so we don't want to keep trying to connect to it, but we also wanted to keep the upstairs running to allow for a downstairs replacement to come in and "fix" it.

At least that was the idea, and seemed better than just panicing the upstairs.

We did actually hit this scenario where a ROP had finished scrubbing but was still "attached" to the downstairs (a bug that is fixed). The actual ROP was eventually deleted, then the port numbers were re-used, and this long running upstairs tried to reconnect (as the new downstairs came online).

The UUID mismatch is one check, but if a downstairs has different region info, that would (should) take the same path here and result in the same end state, whatever we decide that state should be.

My feeling is this condition means something has gone terribly wrong somewhere, and I do think it's better to just hang and require operator intervention instead of either moving forward or panic. Does that seem like the right idea?

mkeeter · 2025-05-22T19:15:09Z

My feeling is this condition means something has gone terribly wrong somewhere, and I do think it's better to just hang and require operator intervention instead of either moving forward or panic. Does that seem like the right idea?

Yup, that sounds reasonable to me!

Do you think having an UpstairsState::Disabled to represent this "terribly wrong" state makes sense? If so, that suggests it should not respond to GoActive requests. I'm not sure how we want to recover; we could add a new API that the agent could hit? We probably don't want to require a full restart, since the upstairs is attached to the Propolis VM.

leftwo · 2025-05-22T21:38:42Z

Do you think having an UpstairsState::Disabled to represent this "terribly wrong" state makes sense? If so, that suggests it should not respond to GoActive requests. I'm not sure how we want to recover; we could add a new API that the agent could hit? We probably don't want to require a full restart, since the upstairs is attached to the Propolis VM.

I don't love the name, but I'm also fine with it. I also don't have any other suggestions.
I do think that, if you find the upstairs in this state, then yeah it should not respond to GoActive requests.

Maybe the path forward would be to replace the Disabled downstairs from the control plane, we could do that without a VM restart. But, really, if we are in this state things are bad, and by blocking GoActive requests, we have essentially prevented the VM from booting (which is probably good, as we don't trust our downstairs targets).

mkeeter · 2025-05-22T22:35:04Z

A few specific comments:

Disabled is not a Downstairs state; it's a newly-added variant in the UpstairsState enum, so it applies to the system as a whole (not to a specific downstairs). The problematic Downstairs will be hanging out in NegotiationState::Start, waiting for the connection one-shot to fire (which will never happen).
This could technically happen after boot – once we connect to a Downstairs, getting kicked out with YouAreNoLongerActive could happen at any time, which causes us to enter this state

One option would be to make this a downstairs state, and not have UpstairsState::Disabled.

This discussion has helped me clarify the vibes of what's going on here: certain failures modes mean that a downstairs should not try to reconnect, basically dropping into a fail-safe mode where it hangs out and waits for further instruction.

Follow-up questions:

What failures should be classified as "drop into fail-safe mode and stop trying to connect"?
What should happen to the upstairs if a downstairs triggers fail-safe mode?
- What should it do with the other downstairs?
How should we recover from fail-safe mode?

I'm not sure if anyone knows the answers yet, but curious to hear everyone's thoughts.

leftwo · 2025-05-23T05:30:05Z

So, I'll also think about this overnight, but

Disabled is not a Downstairs state ..

Ah, okay. I think this is still okay. If one of our downstairs is wrong in a bad way, then, yeah, stopping the upstairs seems like our best choice of all the bad choices we have.

This could technically happen after boot ..

Yes, and if we do live migration, this is what would happen. In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

I'm still thinking about the follow-up questions, I'm going to sleep on those and see what the morning brings me :)

jmpesp · 2025-05-23T14:13:43Z

In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

In the case of live migration, the destination propolis would have received a new volume checkout, bumping the gen numbers. The source propolis' upstairs kinda need to be scrapped after this - any downstairs replacement won't fix it.

jmpesp · 2025-05-23T14:17:53Z

Disabled may not carry enough information. We may want to separate out the different failure modes here:

CannotActivate when we see mismatched downstairs region info for example
KickedOut when we see YouAreNoLongerActive from any of the downstairs.

Both of those are kinda terminal states. Disabled is a bit vague, and I'm not sure we support the idea of Upstairs hanging around doing nothing, apart from before the GoActive is received?

leftwo · 2025-05-23T15:06:23Z

In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

In the case of live migration, the destination propolis would have received a new volume checkout, bumping the gen numbers. The source propolis' upstairs kinda need to be scrapped after this - any downstairs replacement won't fix it.

The source propolis would need a new volume checkout and a new activation if it want to "take back" the downstairs. You are right that a downstairs replacement would not solve anything here :)

leftwo · 2025-05-23T15:48:58Z

Disabled may not carry enough information. We may want to separate out the different failure modes here:

CannotActivate when we see mismatched downstairs region info for example

KickedOut when we see YouAreNoLongerActive from any of the downstairs.

Both of those are kinda terminal states. Disabled is a bit vague, and I'm not sure we support the idea of Upstairs hanging around doing nothing, apart from before the GoActive is received?

We do have a BlockOp::Deactivate that will result in the Upstairs disconnecting from the downstairs then just sitting there. However, we only use that in tests, Propolis never calls it. In theory the upstairs could self deactivate and then sit there waiting to either be taken down or to get a new activation. This is also tricky as a running upstairs has targets and will expect them to be the same, the re-activation can only provide an update generation number and not change anything else.

leftwo · 2025-05-23T16:25:02Z

Follow-up questions:

What failures should be classified as "drop into fail-safe mode and stop trying to connect"?

My current list of things that a reconnect to the same downstairs won't fix:
Generation number too low.
Expected downstairs UUID mismatch
Any RegionInfo mismatch (block size, extents, etc..)
Failure to complete reconciliation (Never actually seen this, not sure how it would even happen)
Incompatible Version (not yet, but maybe someday :) )
Encryption mismatch (not expected to ever happen in production)
Read only mismatch.

And, if things were connected and the upstairs was activated and some other upstairs takes over, the current upstairs would get kicked out and then attempt to reconnect (which I think could be okay) and then be denied because a higher generation number upstairs has connected.

What should happen to the upstairs if a downstairs triggers fail-safe mode?

What should it do with the other downstairs?

The negotiation phase should determine what happens. The only place (I believe) that a downstairs would trigger something would be if another upstairs took over.

So, what I think should happen is this.

If the upstairs is not yet active, then it gives up and does not activate. It can return error if an activation request has been sent, and .. then it just sits there (at least for now).

How should we recover from fail-safe mode?

We include James' phone number in the error message.

Seriously, I guess it depends. If we fail on startup before activation, then either the VCR (crucible opts) are bad, or we have some rogue downstairs running on the wrong port. It would most likely be a bug in the software, so recovery may not be an option. My first thought is to hang forever, but I wonder if it would be better to panic and trust that we have left behind enough logs to sift through afterwards.

If we have already activated, and one of the downstairs pulls the plug, then I think the other two should keep going until either we get another downstairs that pulls the plug, or someone send the upstairs a deactivation request. Once we have two downstairs that have opted out, the upstairs is going to hang on the next flush, though letting the final remaining downstairs get the writes and that flush that are in flight is probably the right thing to do here. And, given that we could be migrating, getting all the data out of the upstairs is what we want.

mkeeter · 2025-06-10T14:46:55Z

I've done some thinking and made a few more changes based on the discussion above.

Here's the semantics I've settled on for UpstairsState::Disabled:

It can be entered through the following paths:
- Client negotiation failed (IncompatibleSession or IncompatibleSettings)
- Collation failed during negotiation
- Getting NoLongerActive from a Downstairs
- Getting UuidMismatch from a Downstairs
When we enter this state
- All three Downstairs clients are marked as disabled
  - Their client IO task is stopped
  - When it automatically restarts, it will not try to connect
- Any pending activation request is replied with an error
The Upstairs can reinitialized by sending a fresh activation request
The Upstairs can't be deactivated from this state, because it's not active

The major difference between this and @leftwo 's suggestions is that all three clients are stopped when we disable the upstairs (for any reason). Alan suggested continuing to run while > 1 Downstairs were active, but Downstairs running while the Upstairs is disabled makes me nervous – it seems like an edge case generator, and we could theoretically limp along in this state indefinitely.

Disabling all three Downstairs immediately puts us into a known, well-behaved state. The downside is that some IO may be dropped, but it doesn't violate any invariants.

The UpstairsState::Disabled variant now also includes a CrucibleError, so we can see why we're in that state.

leftwo

I tried this out with a modified downstairs that would do slow IOs, just to see how it handled it, and no issues there.

While I do like the idea of keeping 2/3 downstairs going if the 3rd has left, I see how it can be confusing about the upstairs state, and the current behavior already is to deactivate the upstairs when a single downstairs encounters an unrecoverable, so it's not like the behavior is any different now.

jmpesp

Nice :)

jmpesp · 2025-06-12T14:14:05Z

upstairs/src/client.rs

@@ -1265,7 +1261,9 @@ impl DownstairsClient {
    ///
    /// The IO task will automatically restart in the main event handler
    pub(crate) fn disable(&mut self, up_state: &UpstairsState) {
-        self.halt_io_task(up_state, ClientStopReason::Disabled);
+        if self.client_task.client_connect_tx.is_none() {


This diff means that disable is not unconditional anymore - I know halt_io_task does a take as written today but that could change and then this block of code wouldn't be correct.

I added some comments in a240339 and made the check slightly stricter, so it will only skip disabling if it would be a no-op.

The `WaitActive` state has outlived its usefulness. Right now, there are two wait points in the Upstair's client task: - It waits for a oneshot (`client_connect_tx/rx`) to be fired before connecting to the Downstairs - After receiving `YesItsMe` from the downstairs, it waits for the upstairs to be activated before sending `PromoteToActive`. This is implemented through a runtime check when it reaches the `WaitForActive` state It turns out that these two things always happen together; there's no case where we want to connect to a Downstairs, get to `YesItsMe`, then _not_ promote it to active. This PR removes the `WaitActive` state, moving straight from `Start` to `WaitForPromote` once the connection oneshot fires. I believe this fixes the (hypothetical) panic proposed in #1721, although I haven't written out a unit test for it. Most of the LOC changes are removing the increasingly-outdated block comment about negotiation; it's all nicely represented the `NegotiationState` docstring, which has been kept up to date.

Pick up the following propolis PRs: - Bump crucible rev to latest (oxidecomputer/propolis#922) - Added block_size for file backends in propolis_server (workers is optional) (oxidecomputer/propolis#917) Pick up the following crucible PRs: - Snapshots existing already are ok! (oxidecomputer/crucible#1759) - Less verbose logging (oxidecomputer/crucible#1756) - Remove unused `Vec<JoinHandle>` (oxidecomputer/crucible#1754) - Split "check reconciliation state" from "start reconciliation" (oxidecomputer/crucible#1732) - Improve `ClientIoTask` start logic (oxidecomputer/crucible#1731) - Use data-bearing enum variants pattern in negotiation (oxidecomputer/crucible#1727) - Make Downstairs stoppable (oxidecomputer/crucible#1730) - Don't log every region's metadata (oxidecomputer/crucible#1729) - Compute reconciliation from `ClientMap` instead of three clients (oxidecomputer/crucible#1726) - Make Offline -> Faulted transition happen without reconnecting (oxidecomputer/crucible#1725) - Remove `WaitActive` state during negotiation (oxidecomputer/crucible#1722) - Add explicit `UpstairsState::Disabled` (oxidecomputer/crucible#1721) - Print version on startup for pantry and agent (oxidecomputer/crucible#1723) - Update test mem to also show physical space used by regions. (oxidecomputer/crucible#1724) - Add AllStopped command and endpoint for downstairs status (oxidecomputer/crucible#1718) - Update tests to honor REGION_SETS env if provided. (oxidecomputer/crucible#1720) - Fix panic in `set_active_request` when client is in Stopping(Replacing) state (oxidecomputer/crucible#1717) - DTrace updates (oxidecomputer/crucible#1715)

Pick up the following propolis PRs: - Bump crucible rev to latest (oxidecomputer/propolis#922) - Added block_size for file backends in propolis_server (workers is optional) (oxidecomputer/propolis#917) Pick up the following crucible PRs: - Snapshots existing already are ok! (oxidecomputer/crucible#1759) - Less verbose logging (oxidecomputer/crucible#1756) - Remove unused `Vec<JoinHandle>` (oxidecomputer/crucible#1754) - Split "check reconciliation state" from "start reconciliation" (oxidecomputer/crucible#1732) - Improve `ClientIoTask` start logic (oxidecomputer/crucible#1731) - Use data-bearing enum variants pattern in negotiation (oxidecomputer/crucible#1727) - Make Downstairs stoppable (oxidecomputer/crucible#1730) - Don't log every region's metadata (oxidecomputer/crucible#1729) - Compute reconciliation from `ClientMap` instead of three clients (oxidecomputer/crucible#1726) - Make Offline -> Faulted transition happen without reconnecting (oxidecomputer/crucible#1725) - Remove `WaitActive` state during negotiation (oxidecomputer/crucible#1722) - Add explicit `UpstairsState::Disabled` (oxidecomputer/crucible#1721) - Print version on startup for pantry and agent (oxidecomputer/crucible#1723) - Update test mem to also show physical space used by regions. (oxidecomputer/crucible#1724) - Add AllStopped command and endpoint for downstairs status (oxidecomputer/crucible#1718) - Update tests to honor REGION_SETS env if provided. (oxidecomputer/crucible#1720) - Fix panic in `set_active_request` when client is in Stopping(Replacing) state (oxidecomputer/crucible#1717) - DTrace updates (oxidecomputer/crucible#1715)

mkeeter requested review from jmpesp and leftwo May 22, 2025 18:22

mkeeter mentioned this pull request May 22, 2025

Remove WaitActive state during negotiation #1722

Merged

mkeeter added 2 commits June 9, 2025 14:45

Add explicit UpstairsState::Disabled

f324ed7

Disable all three clients together

e76f248

mkeeter force-pushed the mkeeter/add-disabled-state branch from d4d8223 to e76f248 Compare June 10, 2025 14:31

leftwo approved these changes Jun 10, 2025

View reviewed changes

jmpesp approved these changes Jun 12, 2025

View reviewed changes

Add some comments to DownstairsClient::disable

a240339

mkeeter merged commit 03858b9 into main Jun 13, 2025
17 checks passed

mkeeter deleted the mkeeter/add-disabled-state branch June 13, 2025 13:38

jmpesp mentioned this pull request Aug 11, 2025

Bump crucible and propolis revs to latest oxidecomputer/omicron#8809

Merged

Add explicit UpstairsState::Disabled #1721

Add explicit UpstairsState::Disabled #1721

Uh oh!

Conversation

mkeeter commented May 22, 2025

Uh oh!

leftwo commented May 22, 2025

Uh oh!

mkeeter commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leftwo commented May 22, 2025

Uh oh!

mkeeter commented May 22, 2025

Uh oh!

leftwo commented May 23, 2025

Uh oh!

jmpesp commented May 23, 2025

Uh oh!

jmpesp commented May 23, 2025

Uh oh!

leftwo commented May 23, 2025

Uh oh!

leftwo commented May 23, 2025

Uh oh!

leftwo commented May 23, 2025

Uh oh!

mkeeter commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leftwo left a comment

Choose a reason for hiding this comment

Uh oh!

jmpesp left a comment

Choose a reason for hiding this comment

Uh oh!

jmpesp Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

mkeeter Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add explicit `UpstairsState::Disabled` #1721

Add explicit `UpstairsState::Disabled` #1721

mkeeter commented May 22, 2025 •

edited

Loading

mkeeter commented Jun 10, 2025 •

edited

Loading