stabilize worker_total_busy_duration#6899
Conversation
…ch and Histogram out of unstable flag
| /// Whether the histogram used to aggregate a metric uses a linear or | ||
| /// logarithmic scale. | ||
| #[derive(Debug, Copy, Clone, Eq, PartialEq)] | ||
| #[non_exhaustive] | ||
| #[allow(unreachable_pub)] | ||
| pub enum HistogramScale { | ||
| /// Linear bucket scale | ||
| Linear, | ||
|
|
||
| /// Logarithmic bucket scale | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
is this change required for the mean time stabilization or just a coincidental change?
There was a problem hiding this comment.
I believe this change is required.
At master we have implemented another struct WorkerMetrics at tokio/src/runtime/metrics/mock.rs which didn't use struct Histogram defined at tokio/src/runtime/metrics/histogram.rs.
In order to stabilize the API, I have to removed this mock WorkerMetrics and instead point to the "real" WorkerMetrics at tokio/src/runtime/metrics/worker.rs
The "real" WorkerMetrics has a field poll_count_histogram which is Option<Histogram>, and thus will attempt to parse tokio/src/runtime/metrics/histogram.rs. From there, you can see Histogram and HistogramBuilder both refers to HistogramScale which is hidden inside tokio_unstable. I think this wouldn't compile.
(I might be wrong though so feel free to correct me)
There was a problem hiding this comment.
I need to read through this PR in more detail so bare with me, but we shouldn't be making this stable and public if it isn't actually needed to to stabilize worker_total_busy_duration (and I'm mostly sure that it isn't).
In general, all the #[allow(dead_code)] that are required for this chnage give me the impression that we're exposing something that is only really used in tokio_unstable and so we should find a way to expose it only unstable tokio_unstable.
There was a problem hiding this comment.
Thanks for your comment!
I think the core issue here is that by exposing worker_total_busy_duration, we are also exposing the "real" WorkerMetrics at tokio/src/runtime/metrics/worker.rs, which in turn will attempt to parse HistogramScale.
Currently at master branch, when no tokio_unstable flag is passed in, the WorkerMetrics at tokio/src/runtime/metrics/mock.rs will be parsed (which is just an empty struct). And if the flag is passed in, the WorkerMetrics at tokio/src/runtime/metrics/worker.rs will be parsed
There was a problem hiding this comment.
it's not actually public, it's still behind config_unstable. (Just verified via the autogenerated docs)
There was a problem hiding this comment.
Yeah becoz HistogramScale is still behind cfg_unstable_metrics:
There was a problem hiding this comment.
OK, now I believe I understand.
I would suggest that instead of compiling with the entire worker metrics in order to access only the busy_duration_total field, we should gate all the fields that we won't be using on cfg_unstable_metrics. Otherwise users will still be paying the price for metrics which they don't have access to - and that is something that we would really like to avoid.
Have a look at the runtime Builder for an example of how we have fields and implementations that touch them gated on a cfg flag:
tokio/tokio/src/runtime/builder.rs
Lines 125 to 126 in 512e9de
As a general rule, if you need #[allow(dead_code)] then there is probably something that should be gated on a cfg flag instead.
There was a problem hiding this comment.
Thanks @hds . I've reverted changes to histogram.rs. Also the real Histogram, HistogramBuilder and HistogramBatch are gated behind unstable flag now, and instead I've created a mocked version of these for compilation.
For WorkerMetrics, as we target to stabilize more metrics, I'd suggest exposing all fields instead of stabilizing only busy_duration_total and putting other fields behind unstable.
Let me know what you think
There was a problem hiding this comment.
@Owen-CH-Leung the problem with stabilizing all of WorkerMetrics is that when tokio_unstable is not enabled, the runtime will pay the price for all those metrics, but there will be no way to access them. For this reason I think that it would be better to only stabilize what we're exposing.
There was a problem hiding this comment.
Thanks @hds . I've made changes to hide most of the fields of WorkerMetrics behind unstable flag, except for busy_duration_total, queue_depth and thread_id. The latter 2 fields are needed in order for set_queue_depth and set_thread_id to work properly.
I've also enriched the mock MetricsBatch to have minimal implementation of batch::MetricsBatch so that the worker_total_busy_duration API can function properly under stable build
Let me know your thoughts!
hds
left a comment
There was a problem hiding this comment.
Please bare with me as I haven't finished the review, but my first impression is that this PR is making more things public than is strictly necessary for the stabilization of worker_total_busy_duration and we should try to avoid that.
This is especially true since #6897 is also touching the histogram (although not taking it out of tokio_unstable) and that would be a breaking change if this PR is released first.
| /// Whether the histogram used to aggregate a metric uses a linear or | ||
| /// logarithmic scale. | ||
| #[derive(Debug, Copy, Clone, Eq, PartialEq)] | ||
| #[non_exhaustive] | ||
| #[allow(unreachable_pub)] | ||
| pub enum HistogramScale { | ||
| /// Linear bucket scale | ||
| Linear, | ||
|
|
||
| /// Logarithmic bucket scale | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
I need to read through this PR in more detail so bare with me, but we shouldn't be making this stable and public if it isn't actually needed to to stabilize worker_total_busy_duration (and I'm mostly sure that it isn't).
In general, all the #[allow(dead_code)] that are required for this chnage give me the impression that we're exposing something that is only really used in tokio_unstable and so we should find a way to expose it only unstable tokio_unstable.
…nges on histogram.rs and guard it behind unstable flag
…on_total, queue_depth and thread_id
| } | ||
|
|
||
| #[allow(dead_code)] | ||
| // #[allow(dead_code)] |
There was a problem hiding this comment.
| // #[allow(dead_code)] |
| } | ||
|
|
||
| #[allow(dead_code)] | ||
| // #[allow(dead_code)] |
There was a problem hiding this comment.
| // #[allow(dead_code)] |
tokio/src/runtime/metrics/worker.rs
Outdated
| .as_ref() | ||
| .map(|histogram_builder| histogram_builder.build()); | ||
| worker_metrics | ||
| cfg_unstable_metrics! { |
There was a problem hiding this comment.
It would be better if we grouped all the unstable functions together at the bottom of the impl block, instead of spreading them out.
tokio/src/runtime/metrics/worker.rs
Outdated
| #[repr(align(128))] | ||
| pub(crate) struct WorkerMetrics { | ||
| #[cfg(tokio_unstable)] | ||
| #[cfg_attr(docsrs, doc(cfg(tokio_unstable)))] |
There was a problem hiding this comment.
Is this necessary? Since this isn't a public method, it won't appear in the documentation.
tokio/src/runtime/metrics/mock.rs
Outdated
| } | ||
|
|
||
| pub(crate) fn submit(&mut self, _to: &WorkerMetrics, _mean_poll_time: u64) {} | ||
| pub(crate) fn submit(&mut self, worker: &WorkerMetrics, _mean_poll_time: u64) { |
There was a problem hiding this comment.
Do I understand correctly that this function duplicates part of the submit function in batch::MetricsBatch?
I think this is a problematic way of gradually stabilizing metrics, as it opens the possibility of having divirging implementations if a change is made to the "real" MetricsBatch by someone who doesn't realise that there is another one.
This is additionally confusing because this effectively becomes the "stable" implementation, but it lives in a module called mock.
I would propose that we instead split the metrics::MetricsBatch implementation into stable (always compiles) and unstable (gated by cfg option), the same way we've done elsewhere in this PR. The same as with another comment, we would group all the unstable functions into a single cfg_unstable_metrics! block.
There was a problem hiding this comment.
Indeed spliting metrics::MetricsBatch is a much viable way of stabilising. I've adopted your suggestion and split it into stable & unstable (and group unstable functions into a single unstable block. Thanks a lot for reviewing!
hds
left a comment
There was a problem hiding this comment.
The implementation is looking good now. Just a few organizational things so that we do conditional compilation the same as the rest of the code base.
tokio/src/runtime/metrics/batch.rs
Outdated
| pub(crate) fn unparked(&mut self) { | ||
| self.park_unpark_count += 1; | ||
| #[cfg(tokio_unstable)] | ||
| { | ||
| self.park_unpark_count += 1; | ||
| } | ||
| } |
There was a problem hiding this comment.
In the rest of the Tokio code base, we do this a different way, so let's stick to that convension. Instead of gating functionality within a function, we have a separate empty function definition when the cfg flag isn't enabled. So this function would become:
cfg_unstable_metrics! {
/// The worker was unparked.
pub(crate) fn unparked(&mut self) {
self.park_unpark_count += 1;
}
}
cfg_not_unstable_metrics! {
/// The worker was unparked.
pub(crate) fn unparked(&mut self) {}
}Please do the same here. Keep a single cfg_unstable_metrics block (and a single cfg_not_unstable_metrics block) for all the functions that require this behavior, so that they're grouped together.
For the more complex functions above that have a mix of stablized and unstablized implementation, split the unstablized part out into a separate function with an impl in each of the macro blocks (see example in the comment on submit).
tokio/src/runtime/metrics/batch.rs
Outdated
| .store(self.steal_operations, Relaxed); | ||
| worker.poll_count.store(self.poll_count, Relaxed); | ||
|
|
||
| pub(crate) fn submit(&mut self, worker: &WorkerMetrics, _mean_poll_time: u64) { |
There was a problem hiding this comment.
As per the comment on unparked, let's follow the convension elsewhere in the Tokio codebase where we need different implementations for stablized vs. unstablized functionality. Here that would mean the following:
pub(crate) fn submit(&mut self, worker: &WorkerMetrics, mean_poll_time: u64) {
worker
.busy_duration_total
.store(self.busy_duration_total, Relaxed);
self.submit_unstable(worker, mean_poll_time);
}
cfg_not_unstable_metrics! {
#[inline(always)]
fn submit_unstable(&mut self, _worker: &WorkerMetrics, _mean_poll_time: u64) {}
}
cfg_unstable_metrics! {
#[inline(always)]
fn submit_unstable(&mut self, worker: &WorkerMetrics, mean_poll_time: u64){
worker.mean_poll_time.store(_mean_poll_time, Relaxed);
worker.park_count.store(self.park_count, Relaxed);
worker
.park_unpark_count
.store(self.park_unpark_count, Relaxed);
worker.noop_count.store(self.noop_count, Relaxed);
worker.steal_count.store(self.steal_count, Relaxed);
worker
.steal_operations
.store(self.steal_operations, Relaxed);
worker.poll_count.store(self.poll_count, Relaxed);
worker
.local_schedule_count
.store(self.local_schedule_count, Relaxed);
worker.overflow_count.store(self.overflow_count, Relaxed);
if let Some(poll_timer) = &self.poll_timer {
let dst = worker.poll_count_histogram.as_ref().unwrap();
poll_timer.poll_counts.submit(dst);
}
}
}Use the same cfg_unstable_metrics and cfg_not_unstable_metrics blocks as for the other functions that need them (so there is only one of each in impl MetricsBatch.
tokio/src/runtime/metrics/batch.rs
Outdated
|
|
||
| impl MetricsBatch { | ||
| pub(crate) fn new(worker_metrics: &WorkerMetrics) -> MetricsBatch { | ||
| pub(crate) fn new(_worker_metrics: &WorkerMetrics) -> MetricsBatch { |
There was a problem hiding this comment.
Split this into 2 implementations each gated by cfg_(not_)unstable_metrics!.
tokio/src/runtime/metrics/mod.rs
Outdated
| mod worker; | ||
| pub(crate) use worker::WorkerMetrics; |
There was a problem hiding this comment.
Please move the modules and imports that are in both the macro blocks out above them, we don't need to gate them at all.
tokio/src/runtime/metrics/worker.rs
Outdated
| .as_ref() | ||
| .map(|histogram_builder| histogram_builder.build()); | ||
| worker_metrics | ||
| cfg_not_unstable_metrics! { |
There was a problem hiding this comment.
Let's move this down to be right above the cfg_unstable_metrics! block so that we keep the conditionally compiled implementations together.
|
@hds Thanks for your detailed review! I've revised the PR to gate code based on |
rcoh
left a comment
There was a problem hiding this comment.
Overall this looks good to me. I think incremental stabilization is inherently annoying to do—I think it might be better if we invested in some primitives. I think we should also endeavor to standardize the way functions are made to be no-ops.
| local_schedule_count: u64, | ||
|
|
||
| #[cfg(tokio_unstable)] | ||
| /// Number of tasks moved to the global queue to make space in the local |
There was a problem hiding this comment.
consider reorganizing to put the stable fields on the top to make it clearer
tokio/src/runtime/metrics/batch.rs
Outdated
| cfg_not_unstable_metrics! { | ||
| /// Start polling an individual task | ||
| pub(crate) fn start_poll(&mut self) {} | ||
| } | ||
|
|
||
| if let Some(poll_timer) = &mut self.poll_timer { | ||
| poll_timer.poll_started_at = Instant::now(); | ||
| cfg_unstable_metrics! { | ||
| /// Stop polling an individual task | ||
| pub(crate) fn end_poll(&mut self) { | ||
| #[cfg(tokio_unstable)] | ||
| if let Some(poll_timer) = &mut self.poll_timer { | ||
| let elapsed = duration_as_u64(poll_timer.poll_started_at.elapsed()); | ||
| poll_timer.poll_counts.measure(elapsed, 1); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
reading through this, I wonder if we should make a macro specifically for this pattern, something like:
cfg_metrics! {
stable: {
...
},
unstable: { ... }
}
There was a problem hiding this comment.
if we are going to have a lot of a/b code when metrics or stable or not, could be helpful to avoid bugs
tokio/src/runtime/metrics/batch.rs
Outdated
| #[cfg(tokio_unstable)] { | ||
| self.steal_count += _by as u64; | ||
| } |
There was a problem hiding this comment.
I think we should probably pick a pattern—either cfg-ing the body of the function or the function entirely instead of using both
There was a problem hiding this comment.
Thanks @rcoh . I've added a new macro and put all the ab code implementation into this macro. I also recorganize fields so that stable fields come first. Please can I have your review again ? Thanks!
tokio/src/runtime/metrics/batch.rs
Outdated
| pub(crate) fn incr_steal_count(&mut self, _by: u16) { | ||
| self.steal_count += _by as u64; | ||
| } |
There was a problem hiding this comment.
| pub(crate) fn incr_steal_count(&mut self, _by: u16) { | |
| self.steal_count += _by as u64; | |
| } | |
| pub(crate) fn incr_steal_count(&mut self, by: u16) { | |
| self.steal_count += by as u64; | |
| } |
There was a problem hiding this comment.
Thanks & revised
| @@ -15,40 +18,50 @@ use std::thread::ThreadId; | |||
| #[derive(Debug, Default)] | |||
| #[repr(align(128))] | |||
| pub(crate) struct WorkerMetrics { | |||
There was a problem hiding this comment.
same—please move stable fields to top
There was a problem hiding this comment.
Thanks. Moved stable fields to top
tokio/src/runtime/metrics/worker.rs
Outdated
| /// Number of tasks currently in the local queue. Used only by the | ||
| /// current-thread scheduler. | ||
| pub(crate) queue_depth: MetricAtomicUsize, |
There was a problem hiding this comment.
is this intentionally stabilized?
There was a problem hiding this comment.
I believe queue_depth is already stabilised at master and is used by the current thread scheduler :
rcoh
left a comment
There was a problem hiding this comment.
This LGTM (other than three small notes inline).
I am fine with it as is—one option that other reviewers may consider is using the metrics variant macro you defined on the function bodies instead—unless for macro reasons that's impossible? In any case, I found it plenty readable since the function bodies are all quite short.
Thanks for the iterations on this?
tokio/src/runtime/metrics/batch.rs
Outdated
| unstable: { | ||
| /// The worker is about to park. | ||
| pub(crate) fn about_to_park(&mut self) { | ||
| #[cfg(tokio_unstable)] |
There was a problem hiding this comment.
don't need a double config here
There was a problem hiding this comment.
Thanks & removed double cfg
tokio/src/runtime/metrics/batch.rs
Outdated
| unstable: { | ||
| /// Stop polling an individual task | ||
| pub(crate) fn end_poll(&mut self) { | ||
| #[cfg(tokio_unstable)] |
There was a problem hiding this comment.
Thanks & removed double cfg
| } | ||
| } | ||
|
|
||
| macro_rules! cfg_metrics_variant { |
There was a problem hiding this comment.
Added. Thanks
…tion' into stabilize_worker_total_busy_duration
|
LGTM, thanks! |
|
Thanks for doing this, and thanks @rcoh for the review. It isn't great that stabilizing a metric takes a change whose diff is 600+ lines long. Why is the change so large? I'm not saying there's a problem with your PR; if our codebase makes it difficult, then that's on us, but I'd like to understand the reason. |
|
minimal stabilizations are usually OK. The problem with this (and the
reason it ended up so large) is that it was the first stabilization of a
component of the metrics system inside the workers.
Previously, on stable, the entire struct was subbed with a mock/stub and
the actual impl was only enabled on unstable. This metric was the first
metric producing this to be enabled.
There are a couple of ways that we could make this smaller:
- Accept that the metrics will be written to, even on stable (potential
performance cost)
- allow more dead code to exist (which I think would mitigate the level of
cfg gating required)
Ultimately, the heavily nested module structure where everything is a layer
of re-exports (not sure why this is, some historical reason perhaps?)
definitely makes stabilizing these metrics more of a hassle
…On Sat, Feb 15, 2025 at 5:01 PM Alice Ryhl ***@***.***> wrote:
Thanks for doing this, and thanks @rcoh <https://github.com/rcoh> for the
review.
It isn't great that stabilizing a metric takes a change whose diff is 600+
lines long. Why is the change so large? I'm not saying there's a problem
with your PR; if our codebase makes it difficult, then that's on us, but
I'd like to understand the reason.
cc @Noah-Kennedy <https://github.com/Noah-Kennedy> @hds
<https://github.com/hds>
—
Reply to this email directly, view it on GitHub
<#6899 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADYKZ6NPMU2QSDNSTXFHOT2P62KDAVCNFSM6AAAAABPYEOHG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRRGEZDCOJXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: Darksonn]*Darksonn* left a comment (tokio-rs/tokio#6899)
<#6899 (comment)>
Thanks for doing this, and thanks @rcoh <https://github.com/rcoh> for the
review.
It isn't great that stabilizing a metric takes a change whose diff is 600+
lines long. Why is the change so large? I'm not saying there's a problem
with your PR; if our codebase makes it difficult, then that's on us, but
I'd like to understand the reason.
cc @Noah-Kennedy <https://github.com/Noah-Kennedy> @hds
<https://github.com/hds>
—
Reply to this email directly, view it on GitHub
<#6899 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADYKZ6NPMU2QSDNSTXFHOT2P62KDAVCNFSM6AAAAABPYEOHG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRRGEZDCOJXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@Darksonn As @rcoh mentioned, it's partly due to this being the first worker metric to be stabilized. Also because I asked the authors to not use I'm not much of a macros person, but I do think that the |
hds
left a comment
There was a problem hiding this comment.
Looks good to me, just need to get the FreeBSD tests working.
Thank you for all your work on this one!
|
Ok, thanks, that's fine then. The FreeBSD tests are already fixed on master. |
Shall we merge this PR ? I'd like to file a few more PRs to stabilize more metrics after this |
Bumps tokio from 1.44.2 to 1.45.0. Release notes Sourced from tokio's releases. Tokio v1.45.0 Added metrics: stabilize worker_total_busy_duration, worker_park_count, and worker_unpark_count (#6899, #7276) process: add Command::spawn_with (#7249) Changed io: do not require Unpin for some trait impls (#7204) rt: mark runtime::Handle as unwind safe (#7230) time: revert internal sharding implementation (#7226) Unstable rt: remove alt multi-threaded runtime (#7275) #6899: tokio-rs/tokio#6899 #7276: tokio-rs/tokio#7276 #7249: tokio-rs/tokio#7249 #7204: tokio-rs/tokio#7204 #7230: tokio-rs/tokio#7230 #7226: tokio-rs/tokio#7226 #7275: tokio-rs/tokio#7275 Commits 00754c8 chore: prepare Tokio v1.45.0 (#7308) 1ae9434 time: revert "use sharding for timer implementation" related changes (#7226) 8895bba ci: Test AArch64 Windows (#7288) 48ca254 time: update sleep documentation to reflect maximum allowed duration (#7302) a0af02a compat: add more documentation to tokio_util::compat (#7279) 0ce3a11 metrics: stabilize worker_park_count and worker_unpark_count (#7276) 1ea9ce1 ci: fix cfg!(miri) declarations in tests (#7286) 4d4d126 chore: prepare tokio-util v0.7.15 (#7283) 5490267 fs: update the mockall dev dependency to 0.13.0 (#7234) 1434b32 examples: improve echo example consistency (#7256) Additional commits viewable in compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase. Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: @dependabot rebase will rebase this PR @dependabot recreate will recreate this PR, overwriting any edits that have been made to it @dependabot merge will merge this PR after your CI passes on it @dependabot squash and merge will squash and merge this PR after your CI passes on it @dependabot cancel merge will cancel a previously requested merge and block automerging @dependabot reopen will reopen this PR if it is closed @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Motivation
Stabilize worker_total_busy_duration so it can be used outside of --cfg tokio_unstable
Solution
Move the API impl out of tokio_unstable flag
Ref: #6546