Skip to content

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented May 22, 2025

There's a weird phantom state that's been hiding in the Upstairs state machine; this PR makes it explicit.

In the handler for YouAreNoLongerActive and on_uuid_mismatch, the Upstairs performs a peculiar ritual:

        // Restart the state machine for this downstairs client
        self.downstairs.clients[client_id].disable(&self.state);
        self.set_inactive(CrucibleError::UuidMismatch);

The effects here are somewhat confusing:

  • Calling DownstairsClient::disable stops a client with ClientStopReason::Disabled. This stop reason is a special case – it means that the client does not try to reconnect when reinitialized. In all other cases, whether the client connects depends on the upstairs state alone.
  • Upstairs::set_inactive sets the upstairs state to Initializing. It does not do anything else – for example, it doesn't try to stop the other Downstairs.

The end result is that the problematic client is restarted, and does not connect to the Downstairs.

As we rethink the state machine for RFD 542, I'd like to remove this special case. In this PR:

  • Upstairs::set_inactive is renamed to Upstairs::set_disabled and now sets the upstairs state to a new UpstairsState::Disabled state
  • auto_promote: bool is removed from the negotiation state, because we now only depend on the upstairs state

I still think the semantics of the UpstairsState::Disabled are fuzzy and could use some ironing out, but this PR is meant to be a step in the right direction.

For example, going straight to Initializing without shutting down the other Downstairs seems bad! Once the upstairs is in Initializing, it will accept a GoActive request, which will hit this panic on the other Downstairs. This issue remains true after the PR (although the Upstairs will be in Disabled instead).

@leftwo
Copy link
Contributor

leftwo commented May 22, 2025

The end result is that the problematic client is restarted, and does not connect to the Downstairs.

So, I believe this path was to handle what happens when we have a bad set of targets for the downstairs.
We don't believe the downstairs we have connected to is correct, so we don't want to keep trying to connect to it, but we also wanted to keep the upstairs running to allow for a downstairs replacement to come in and "fix" it.

At least that was the idea, and seemed better than just panicing the upstairs.

We did actually hit this scenario where a ROP had finished scrubbing but was still "attached" to the downstairs (a bug that is fixed). The actual ROP was eventually deleted, then the port numbers were re-used, and this long running upstairs tried to reconnect (as the new downstairs came online).

The UUID mismatch is one check, but if a downstairs has different region info, that would (should) take the same path here and result in the same end state, whatever we decide that state should be.

My feeling is this condition means something has gone terribly wrong somewhere, and I do think it's better to just hang and require operator intervention instead of either moving forward or panic. Does that seem like the right idea?

@mkeeter
Copy link
Contributor Author

mkeeter commented May 22, 2025

My feeling is this condition means something has gone terribly wrong somewhere, and I do think it's better to just hang and require operator intervention instead of either moving forward or panic. Does that seem like the right idea?

Yup, that sounds reasonable to me!

Do you think having an UpstairsState::Disabled to represent this "terribly wrong" state makes sense? If so, that suggests it should not respond to GoActive requests. I'm not sure how we want to recover; we could add a new API that the agent could hit? We probably don't want to require a full restart, since the upstairs is attached to the Propolis VM.

@leftwo
Copy link
Contributor

leftwo commented May 22, 2025

Do you think having an UpstairsState::Disabled to represent this "terribly wrong" state makes sense? If so, that suggests it should not respond to GoActive requests. I'm not sure how we want to recover; we could add a new API that the agent could hit? We probably don't want to require a full restart, since the upstairs is attached to the Propolis VM.

I don't love the name, but I'm also fine with it. I also don't have any other suggestions.
I do think that, if you find the upstairs in this state, then yeah it should not respond to GoActive requests.

Maybe the path forward would be to replace the Disabled downstairs from the control plane, we could do that without a VM restart. But, really, if we are in this state things are bad, and by blocking GoActive requests, we have essentially prevented the VM from booting (which is probably good, as we don't trust our downstairs targets).

@mkeeter
Copy link
Contributor Author

mkeeter commented May 22, 2025

A few specific comments:

  • Disabled is not a Downstairs state; it's a newly-added variant in the UpstairsState enum, so it applies to the system as a whole (not to a specific downstairs). The problematic Downstairs will be hanging out in NegotiationState::Start, waiting for the connection one-shot to fire (which will never happen).
  • This could technically happen after boot – once we connect to a Downstairs, getting kicked out with YouAreNoLongerActive could happen at any time, which causes us to enter this state

One option would be to make this a downstairs state, and not have UpstairsState::Disabled.

This discussion has helped me clarify the vibes of what's going on here: certain failures modes mean that a downstairs should not try to reconnect, basically dropping into a fail-safe mode where it hangs out and waits for further instruction.

Follow-up questions:

  • What failures should be classified as "drop into fail-safe mode and stop trying to connect"?
  • What should happen to the upstairs if a downstairs triggers fail-safe mode?
    • What should it do with the other downstairs?
  • How should we recover from fail-safe mode?

I'm not sure if anyone knows the answers yet, but curious to hear everyone's thoughts.

@leftwo
Copy link
Contributor

leftwo commented May 23, 2025

So, I'll also think about this overnight, but

Disabled is not a Downstairs state ..

Ah, okay. I think this is still okay. If one of our downstairs is wrong in a bad way, then, yeah, stopping the upstairs seems like our best choice of all the bad choices we have.

This could technically happen after boot ..

Yes, and if we do live migration, this is what would happen. In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

I'm still thinking about the follow-up questions, I'm going to sleep on those and see what the morning brings me :)

@jmpesp
Copy link
Contributor

jmpesp commented May 23, 2025

In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

In the case of live migration, the destination propolis would have received a new volume checkout, bumping the gen numbers. The source propolis' upstairs kinda need to be scrapped after this - any downstairs replacement won't fix it.

@jmpesp
Copy link
Contributor

jmpesp commented May 23, 2025

Disabled may not carry enough information. We may want to separate out the different failure modes here:

  • CannotActivate when we see mismatched downstairs region info for example
  • KickedOut when we see YouAreNoLongerActive from any of the downstairs.

Both of those are kinda terminal states. Disabled is a bit vague, and I'm not sure we support the idea of Upstairs hanging around doing nothing, apart from before the GoActive is received?

@leftwo
Copy link
Contributor

leftwo commented May 23, 2025

In that situation we need to consider how could we allow an upstairs that got the YouAreNoLongerActive because of a migration, then that migration failed and we are trying to put the original pieces back together. Is there a path by which we could do that?

In the case of live migration, the destination propolis would have received a new volume checkout, bumping the gen numbers. The source propolis' upstairs kinda need to be scrapped after this - any downstairs replacement won't fix it.

The source propolis would need a new volume checkout and a new activation if it want to "take back" the downstairs. You are right that a downstairs replacement would not solve anything here :)

@leftwo
Copy link
Contributor

leftwo commented May 23, 2025

Disabled may not carry enough information. We may want to separate out the different failure modes here:

  • CannotActivate when we see mismatched downstairs region info for example
  • KickedOut when we see YouAreNoLongerActive from any of the downstairs.

Both of those are kinda terminal states. Disabled is a bit vague, and I'm not sure we support the idea of Upstairs hanging around doing nothing, apart from before the GoActive is received?

We do have a BlockOp::Deactivate that will result in the Upstairs disconnecting from the downstairs then just sitting there. However, we only use that in tests, Propolis never calls it. In theory the upstairs could self deactivate and then sit there waiting to either be taken down or to get a new activation. This is also tricky as a running upstairs has targets and will expect them to be the same, the re-activation can only provide an update generation number and not change anything else.

@leftwo
Copy link
Contributor

leftwo commented May 23, 2025

Follow-up questions:

  • What failures should be classified as "drop into fail-safe mode and stop trying to connect"?

My current list of things that a reconnect to the same downstairs won't fix:
Generation number too low.
Expected downstairs UUID mismatch
Any RegionInfo mismatch (block size, extents, etc..)
Failure to complete reconciliation (Never actually seen this, not sure how it would even happen)
Incompatible Version (not yet, but maybe someday :) )
Encryption mismatch (not expected to ever happen in production)
Read only mismatch.

And, if things were connected and the upstairs was activated and some other upstairs takes over, the current upstairs would get kicked out and then attempt to reconnect (which I think could be okay) and then be denied because a higher generation number upstairs has connected.

  • What should happen to the upstairs if a downstairs triggers fail-safe mode?

    • What should it do with the other downstairs?

The negotiation phase should determine what happens. The only place (I believe) that a downstairs would trigger something would be if another upstairs took over.

So, what I think should happen is this.

  • If the upstairs is not yet active, then it gives up and does not activate. It can return error if an activation request has been sent, and .. then it just sits there (at least for now).
  • How should we recover from fail-safe mode?

We include James' phone number in the error message.

Seriously, I guess it depends. If we fail on startup before activation, then either the VCR (crucible opts) are bad, or we have some rogue downstairs running on the wrong port. It would most likely be a bug in the software, so recovery may not be an option. My first thought is to hang forever, but I wonder if it would be better to panic and trust that we have left behind enough logs to sift through afterwards.

If we have already activated, and one of the downstairs pulls the plug, then I think the other two should keep going until either we get another downstairs that pulls the plug, or someone send the upstairs a deactivation request. Once we have two downstairs that have opted out, the upstairs is going to hang on the next flush, though letting the final remaining downstairs get the writes and that flush that are in flight is probably the right thing to do here. And, given that we could be migrating, getting all the data out of the upstairs is what we want.

@mkeeter mkeeter force-pushed the mkeeter/add-disabled-state branch from d4d8223 to e76f248 Compare June 10, 2025 14:31
@mkeeter
Copy link
Contributor Author

mkeeter commented Jun 10, 2025

I've done some thinking and made a few more changes based on the discussion above.

Here's the semantics I've settled on for UpstairsState::Disabled:

  • It can be entered through the following paths:
    • Client negotiation failed (IncompatibleSession or IncompatibleSettings)
    • Collation failed during negotiation
    • Getting NoLongerActive from a Downstairs
    • Getting UuidMismatch from a Downstairs
  • When we enter this state
    • All three Downstairs clients are marked as disabled
      • Their client IO task is stopped
      • When it automatically restarts, it will not try to connect
    • Any pending activation request is replied with an error
  • The Upstairs can reinitialized by sending a fresh activation request
  • The Upstairs can't be deactivated from this state, because it's not active

The major difference between this and @leftwo 's suggestions is that all three clients are stopped when we disable the upstairs (for any reason). Alan suggested continuing to run while > 1 Downstairs were active, but Downstairs running while the Upstairs is disabled makes me nervous – it seems like an edge case generator, and we could theoretically limp along in this state indefinitely.

Disabling all three Downstairs immediately puts us into a known, well-behaved state. The downside is that some IO may be dropped, but it doesn't violate any invariants.

The UpstairsState::Disabled variant now also includes a CrucibleError, so we can see why we're in that state.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out with a modified downstairs that would do slow IOs, just to see how it handled it, and no issues there.

While I do like the idea of keeping 2/3 downstairs going if the 3rd has left, I see how it can be confusing about the upstairs state, and the current behavior already is to deactivate the upstairs when a single downstairs encounters an unrecoverable, so it's not like the behavior is any different now.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

@@ -1265,7 +1261,9 @@ impl DownstairsClient {
///
/// The IO task will automatically restart in the main event handler
pub(crate) fn disable(&mut self, up_state: &UpstairsState) {
self.halt_io_task(up_state, ClientStopReason::Disabled);
if self.client_task.client_connect_tx.is_none() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff means that disable is not unconditional anymore - I know halt_io_task does a take as written today but that could change and then this block of code wouldn't be correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments in a240339 and made the check slightly stricter, so it will only skip disabling if it would be a no-op.

@mkeeter mkeeter merged commit 03858b9 into main Jun 13, 2025
17 checks passed
@mkeeter mkeeter deleted the mkeeter/add-disabled-state branch June 13, 2025 13:38
mkeeter added a commit that referenced this pull request Jun 16, 2025
The `WaitActive` state has outlived its usefulness.

Right now, there are two wait points in the Upstair's client task:

- It waits for a oneshot (`client_connect_tx/rx`) to be fired before
  connecting to the Downstairs
- After receiving `YesItsMe` from the downstairs, it waits for the
  upstairs to be activated before sending `PromoteToActive`. This is
  implemented through a runtime check when it reaches the `WaitForActive`
  state

It turns out that these two things always happen together; there's no
case where we want to connect to a Downstairs, get to `YesItsMe`, then
_not_ promote it to active.

This PR removes the `WaitActive` state, moving straight from `Start` to
`WaitForPromote` once the connection oneshot fires. I believe this fixes
the (hypothetical) panic proposed in #1721, although I haven't written
out a unit test for it.

Most of the LOC changes are removing the increasingly-outdated block
comment about negotiation; it's all nicely represented the
`NegotiationState` docstring, which has been kept up to date.
jmpesp added a commit to oxidecomputer/omicron that referenced this pull request Aug 11, 2025
Pick up the following propolis PRs:

- Bump crucible rev to latest (oxidecomputer/propolis#922)
- Added block_size for file backends in propolis_server (workers is optional) (oxidecomputer/propolis#917)

Pick up the following crucible PRs:

- Snapshots existing already are ok! (oxidecomputer/crucible#1759)
- Less verbose logging (oxidecomputer/crucible#1756)
- Remove unused `Vec<JoinHandle>` (oxidecomputer/crucible#1754)
- Split "check reconciliation state" from "start reconciliation" (oxidecomputer/crucible#1732)
- Improve `ClientIoTask` start logic (oxidecomputer/crucible#1731)
- Use data-bearing enum variants pattern in negotiation (oxidecomputer/crucible#1727)
- Make Downstairs stoppable (oxidecomputer/crucible#1730)
- Don't log every region's metadata (oxidecomputer/crucible#1729)
- Compute reconciliation from `ClientMap` instead of three clients (oxidecomputer/crucible#1726)
- Make Offline -> Faulted transition happen without reconnecting (oxidecomputer/crucible#1725)
- Remove `WaitActive` state during negotiation (oxidecomputer/crucible#1722)
- Add explicit `UpstairsState::Disabled` (oxidecomputer/crucible#1721)
- Print version on startup for pantry and agent (oxidecomputer/crucible#1723)
- Update test mem to also show physical space used by regions. (oxidecomputer/crucible#1724)
- Add AllStopped command and endpoint for downstairs status (oxidecomputer/crucible#1718)
- Update tests to honor REGION_SETS env if provided. (oxidecomputer/crucible#1720)
- Fix panic in `set_active_request` when client is in Stopping(Replacing) state (oxidecomputer/crucible#1717)
- DTrace updates (oxidecomputer/crucible#1715)
jmpesp added a commit to oxidecomputer/omicron that referenced this pull request Aug 11, 2025
Pick up the following propolis PRs:

- Bump crucible rev to latest (oxidecomputer/propolis#922)
- Added block_size for file backends in propolis_server (workers is optional) (oxidecomputer/propolis#917)

Pick up the following crucible PRs:

- Snapshots existing already are ok! (oxidecomputer/crucible#1759)
- Less verbose logging (oxidecomputer/crucible#1756)
- Remove unused `Vec<JoinHandle>` (oxidecomputer/crucible#1754)
- Split "check reconciliation state" from "start reconciliation" (oxidecomputer/crucible#1732)
- Improve `ClientIoTask` start logic (oxidecomputer/crucible#1731)
- Use data-bearing enum variants pattern in negotiation (oxidecomputer/crucible#1727)
- Make Downstairs stoppable (oxidecomputer/crucible#1730)
- Don't log every region's metadata (oxidecomputer/crucible#1729)
- Compute reconciliation from `ClientMap` instead of three clients (oxidecomputer/crucible#1726)
- Make Offline -> Faulted transition happen without reconnecting (oxidecomputer/crucible#1725)
- Remove `WaitActive` state during negotiation (oxidecomputer/crucible#1722)
- Add explicit `UpstairsState::Disabled` (oxidecomputer/crucible#1721)
- Print version on startup for pantry and agent (oxidecomputer/crucible#1723)
- Update test mem to also show physical space used by regions. (oxidecomputer/crucible#1724)
- Add AllStopped command and endpoint for downstairs status (oxidecomputer/crucible#1718)
- Update tests to honor REGION_SETS env if provided. (oxidecomputer/crucible#1720)
- Fix panic in `set_active_request` when client is in Stopping(Replacing) state (oxidecomputer/crucible#1717)
- DTrace updates (oxidecomputer/crucible#1715)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants