-
Notifications
You must be signed in to change notification settings - Fork 25
Add explicit UpstairsState::Disabled
#1721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So, I believe this path was to handle what happens when we have a bad set of targets for the downstairs. At least that was the idea, and seemed better than just panicing the upstairs. We did actually hit this scenario where a ROP had finished scrubbing but was still "attached" to the downstairs (a bug that is fixed). The actual ROP was eventually deleted, then the port numbers were re-used, and this long running upstairs tried to reconnect (as the new downstairs came online). The UUID mismatch is one check, but if a downstairs has different region info, that would (should) take the same path here and result in the same end state, whatever we decide that state should be. My feeling is this condition means something has gone terribly wrong somewhere, and I do think it's better to just hang and require operator intervention instead of either moving forward or panic. Does that seem like the right idea? |
Yup, that sounds reasonable to me! Do you think having an |
I don't love the name, but I'm also fine with it. I also don't have any other suggestions. Maybe the path forward would be to replace the |
A few specific comments:
One option would be to make this a downstairs state, and not have This discussion has helped me clarify the vibes of what's going on here: certain failures modes mean that a downstairs should not try to reconnect, basically dropping into a fail-safe mode where it hangs out and waits for further instruction. Follow-up questions:
I'm not sure if anyone knows the answers yet, but curious to hear everyone's thoughts. |
So, I'll also think about this overnight, but
Ah, okay. I think this is still okay. If one of our downstairs is wrong in a bad way, then, yeah, stopping the upstairs seems like our best choice of all the bad choices we have.
Yes, and if we do live migration, this is what would happen. In that situation we need to consider how could we allow an upstairs that got the I'm still thinking about the follow-up questions, I'm going to sleep on those and see what the morning brings me :) |
In the case of live migration, the destination propolis would have received a new volume checkout, bumping the gen numbers. The source propolis' upstairs kinda need to be scrapped after this - any downstairs replacement won't fix it. |
Both of those are kinda terminal states. |
The source propolis would need a new volume checkout and a new activation if it want to "take back" the downstairs. You are right that a downstairs replacement would not solve anything here :) |
We do have a |
My current list of things that a reconnect to the same downstairs won't fix: And, if things were connected and the upstairs was activated and some other upstairs takes over, the current upstairs would get kicked out and then attempt to reconnect (which I think could be okay) and then be denied because a higher generation number upstairs has connected.
The negotiation phase should determine what happens. The only place (I believe) that a downstairs would trigger something would be if another upstairs took over. So, what I think should happen is this.
We include James' phone number in the error message. Seriously, I guess it depends. If we fail on startup before activation, then either the VCR (crucible opts) are bad, or we have some rogue downstairs running on the wrong port. It would most likely be a bug in the software, so recovery may not be an option. My first thought is to hang forever, but I wonder if it would be better to panic and trust that we have left behind enough logs to sift through afterwards. If we have already activated, and one of the downstairs pulls the plug, then I think the other two should keep going until either we get another downstairs that pulls the plug, or someone send the upstairs a deactivation request. Once we have two downstairs that have opted out, the upstairs is going to hang on the next flush, though letting the final remaining downstairs get the writes and that flush that are in flight is probably the right thing to do here. And, given that we could be migrating, getting all the data out of the upstairs is what we want. |
d4d8223
to
e76f248
Compare
I've done some thinking and made a few more changes based on the discussion above. Here's the semantics I've settled on for
The major difference between this and @leftwo 's suggestions is that all three clients are stopped when we disable the upstairs (for any reason). Alan suggested continuing to run while > 1 Downstairs were active, but Downstairs running while the Upstairs is disabled makes me nervous – it seems like an edge case generator, and we could theoretically limp along in this state indefinitely. Disabling all three Downstairs immediately puts us into a known, well-behaved state. The downside is that some IO may be dropped, but it doesn't violate any invariants. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this out with a modified downstairs that would do slow IOs, just to see how it handled it, and no issues there.
While I do like the idea of keeping 2/3 downstairs going if the 3rd has left, I see how it can be confusing about the upstairs state, and the current behavior already is to deactivate the upstairs when a single downstairs encounters an unrecoverable, so it's not like the behavior is any different now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
upstairs/src/client.rs
Outdated
@@ -1265,7 +1261,9 @@ impl DownstairsClient { | |||
/// | |||
/// The IO task will automatically restart in the main event handler | |||
pub(crate) fn disable(&mut self, up_state: &UpstairsState) { | |||
self.halt_io_task(up_state, ClientStopReason::Disabled); | |||
if self.client_task.client_connect_tx.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff means that disable
is not unconditional anymore - I know halt_io_task
does a take
as written today but that could change and then this block of code wouldn't be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments in a240339 and made the check slightly stricter, so it will only skip disabling if it would be a no-op.
The `WaitActive` state has outlived its usefulness. Right now, there are two wait points in the Upstair's client task: - It waits for a oneshot (`client_connect_tx/rx`) to be fired before connecting to the Downstairs - After receiving `YesItsMe` from the downstairs, it waits for the upstairs to be activated before sending `PromoteToActive`. This is implemented through a runtime check when it reaches the `WaitForActive` state It turns out that these two things always happen together; there's no case where we want to connect to a Downstairs, get to `YesItsMe`, then _not_ promote it to active. This PR removes the `WaitActive` state, moving straight from `Start` to `WaitForPromote` once the connection oneshot fires. I believe this fixes the (hypothetical) panic proposed in #1721, although I haven't written out a unit test for it. Most of the LOC changes are removing the increasingly-outdated block comment about negotiation; it's all nicely represented the `NegotiationState` docstring, which has been kept up to date.
Pick up the following propolis PRs: - Bump crucible rev to latest (oxidecomputer/propolis#922) - Added block_size for file backends in propolis_server (workers is optional) (oxidecomputer/propolis#917) Pick up the following crucible PRs: - Snapshots existing already are ok! (oxidecomputer/crucible#1759) - Less verbose logging (oxidecomputer/crucible#1756) - Remove unused `Vec<JoinHandle>` (oxidecomputer/crucible#1754) - Split "check reconciliation state" from "start reconciliation" (oxidecomputer/crucible#1732) - Improve `ClientIoTask` start logic (oxidecomputer/crucible#1731) - Use data-bearing enum variants pattern in negotiation (oxidecomputer/crucible#1727) - Make Downstairs stoppable (oxidecomputer/crucible#1730) - Don't log every region's metadata (oxidecomputer/crucible#1729) - Compute reconciliation from `ClientMap` instead of three clients (oxidecomputer/crucible#1726) - Make Offline -> Faulted transition happen without reconnecting (oxidecomputer/crucible#1725) - Remove `WaitActive` state during negotiation (oxidecomputer/crucible#1722) - Add explicit `UpstairsState::Disabled` (oxidecomputer/crucible#1721) - Print version on startup for pantry and agent (oxidecomputer/crucible#1723) - Update test mem to also show physical space used by regions. (oxidecomputer/crucible#1724) - Add AllStopped command and endpoint for downstairs status (oxidecomputer/crucible#1718) - Update tests to honor REGION_SETS env if provided. (oxidecomputer/crucible#1720) - Fix panic in `set_active_request` when client is in Stopping(Replacing) state (oxidecomputer/crucible#1717) - DTrace updates (oxidecomputer/crucible#1715)
Pick up the following propolis PRs: - oxidecomputer/propolis#922 - oxidecomputer/propolis#917 Pick up the following crucible PRs: - oxidecomputer/crucible#1759 - oxidecomputer/crucible#1756 - oxidecomputer/crucible#1754 - oxidecomputer/crucible#1732 - oxidecomputer/crucible#1731 - oxidecomputer/crucible#1727 - oxidecomputer/crucible#1730 - oxidecomputer/crucible#1729 - oxidecomputer/crucible#1726 - oxidecomputer/crucible#1725 - oxidecomputer/crucible#1722 - oxidecomputer/crucible#1721 - oxidecomputer/crucible#1723 - oxidecomputer/crucible#1724 - oxidecomputer/crucible#1718 - oxidecomputer/crucible#1720 - oxidecomputer/crucible#1717 - oxidecomputer/crucible#1715
Pick up the following propolis PRs: - Bump crucible rev to latest (oxidecomputer/propolis#922) - Added block_size for file backends in propolis_server (workers is optional) (oxidecomputer/propolis#917) Pick up the following crucible PRs: - Snapshots existing already are ok! (oxidecomputer/crucible#1759) - Less verbose logging (oxidecomputer/crucible#1756) - Remove unused `Vec<JoinHandle>` (oxidecomputer/crucible#1754) - Split "check reconciliation state" from "start reconciliation" (oxidecomputer/crucible#1732) - Improve `ClientIoTask` start logic (oxidecomputer/crucible#1731) - Use data-bearing enum variants pattern in negotiation (oxidecomputer/crucible#1727) - Make Downstairs stoppable (oxidecomputer/crucible#1730) - Don't log every region's metadata (oxidecomputer/crucible#1729) - Compute reconciliation from `ClientMap` instead of three clients (oxidecomputer/crucible#1726) - Make Offline -> Faulted transition happen without reconnecting (oxidecomputer/crucible#1725) - Remove `WaitActive` state during negotiation (oxidecomputer/crucible#1722) - Add explicit `UpstairsState::Disabled` (oxidecomputer/crucible#1721) - Print version on startup for pantry and agent (oxidecomputer/crucible#1723) - Update test mem to also show physical space used by regions. (oxidecomputer/crucible#1724) - Add AllStopped command and endpoint for downstairs status (oxidecomputer/crucible#1718) - Update tests to honor REGION_SETS env if provided. (oxidecomputer/crucible#1720) - Fix panic in `set_active_request` when client is in Stopping(Replacing) state (oxidecomputer/crucible#1717) - DTrace updates (oxidecomputer/crucible#1715)
Bump crucible and propolis revs to latest Pick up the following propolis PRs: - oxidecomputer/propolis#922 - oxidecomputer/propolis#917 Pick up the following crucible PRs: - oxidecomputer/crucible#1759 - oxidecomputer/crucible#1756 - oxidecomputer/crucible#1754 - oxidecomputer/crucible#1732 - oxidecomputer/crucible#1731 - oxidecomputer/crucible#1727 - oxidecomputer/crucible#1730 - oxidecomputer/crucible#1729 - oxidecomputer/crucible#1726 - oxidecomputer/crucible#1725 - oxidecomputer/crucible#1722 - oxidecomputer/crucible#1721 - oxidecomputer/crucible#1723 - oxidecomputer/crucible#1724 - oxidecomputer/crucible#1718 - oxidecomputer/crucible#1720 - oxidecomputer/crucible#1717 - oxidecomputer/crucible#1715 --------- Co-authored-by: Artemis Everfree <[email protected]>
There's a weird phantom state that's been hiding in the Upstairs state machine; this PR makes it explicit.
In the handler for
YouAreNoLongerActive
andon_uuid_mismatch
, the Upstairs performs a peculiar ritual:The effects here are somewhat confusing:
DownstairsClient::disable
stops a client withClientStopReason::Disabled
. This stop reason is a special case – it means that the client does not try to reconnect when reinitialized. In all other cases, whether the client connects depends on the upstairs state alone.Upstairs::set_inactive
sets the upstairs state toInitializing
. It does not do anything else – for example, it doesn't try to stop the other Downstairs.The end result is that the problematic client is restarted, and does not connect to the Downstairs.
As we rethink the state machine for RFD 542, I'd like to remove this special case. In this PR:
Upstairs::set_inactive
is renamed toUpstairs::set_disabled
and now sets the upstairs state to a newUpstairsState::Disabled
stateauto_promote: bool
is removed from the negotiation state, because we now only depend on the upstairs stateI still think the semantics of the
UpstairsState::Disabled
are fuzzy and could use some ironing out, but this PR is meant to be a step in the right direction.For example, going straight to
Initializing
without shutting down the other Downstairs seems bad! Once the upstairs is inInitializing
, it will accept aGoActive
request, which will hit this panic on the other Downstairs. This issue remains true after the PR (although the Upstairs will be inDisabled
instead).