Skip to content

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented Aug 22, 2025

Proof of concept on display.

This is an attempt to move away from our opaque and broadly-disliked metrics system, and towards two major changes:

  • use structs to show what tags are available roughly at all time, to make our metrics-emitting easier to understand and safer to change (new fields == build failures until fixed)
  • emit histograms rather than timers (this is just step 1 of like 1000+, to show the core tactic)

Once this runs in prod for a bit, so I have more data to play with, I'll build alerts and dashboards based on the new data, and we can check the runtime cost of these new metrics.
I've selected moderately-used and rather-expensive (high cardinality, high number of calls) metrics for this first one, to try to give us a realistic sample.

Comment on lines 228 to 233
t.metrics.replicationLag(
t.ackLevels.UpdateIfNeededAndGetQueueMaxReadLevel(persistence.HistoryTaskCategoryReplication, pollingCluster),
taskInfos,
msgs,
t.scope,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done it this way mostly because:

  1. the metrics-func is private and local to this file, so it's not too concerning to have it depend on excessively-large objects
  2. each argument is a unique type, so it's not possible to pass them wrong (the main alternative is to pass three integers).

Comment on lines 12 to 14
var Module = fx.Provide(func(s tally.Scope) Emitter {
return Emitter{scope: s}
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only constructor, outside tests. makes it fairly likely that I've got my fx / resource / context / etc injection correct.

Comment on lines +132 to +134
// "index -> operation" must be unique for structured.DynamicOperationTags' int lookup to work consistently.
// Duplicate indexes with the same operation name are technically fine, but there doesn't seem to be any benefit in allowing it,
// and it trivially ensures that all indexes have only one operation name.
Copy link
Member Author

@Groxx Groxx Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change this operation-lookup to also require a serviceIdx, if this change is not safe. But it seems probably-safe and a moderate bit less error-prone?
I have not thoroughly checked it though, so this is mostly just aspirational.

case CacheFullCounter, BaseCacheFullCounter:
checkIgnore(History, Common, CacheFullCounter, BaseCacheFullCounter)
continue
case CacheHitCounter, BaseCacheHit:
Copy link
Member Author

@Groxx Groxx Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a bug report / slack message on this one, possibly others

checkIgnore(serviceIdx, serviceIdx, CrossClusterFetchFailures, CrossClusterTaskRespondFailures)
continue
case CadenceRequestsPerTaskList, CadenceRequestsPerTaskListWithoutRollup:
// arguably this one is fine
Copy link
Member Author

@Groxx Groxx Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really is "the same metric", it's just that one has a dual-emitted rollup and the other does not. I don't know why, or if they're even both used.

this might still be a source of mismatched tags / prometheus problems, as it implies two locations, but I haven't seen it specifically yet.

@@ -57,10 +57,10 @@ require (
go.uber.org/thriftrw v1.29.2 // indirect
go.uber.org/yarpc v1.70.3 // indirect
go.uber.org/zap v1.26.0
golang.org/x/net v0.38.0 // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible it'd be good to pull the mod upgrades into a separate PR just due to the liklihood of a rollback

@@ -53,6 +53,10 @@ type (
ServiceIdx int
)

func (s scopeDefinition) GetOperationString() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mild nit: should this just be String? I thought that was a somewhat-commonly used go interface

@@ -1068,7 +1072,7 @@ const (
// -- Operation scopes for History service --
const (
// HistoryStartWorkflowExecutionScope tracks StartWorkflowExecution API calls received by service
HistoryStartWorkflowExecutionScope = iota + NumCommonScopes
HistoryStartWorkflowExecutionScope = iota + NumFrontendScopes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urgh.. yeah, this is way better.

// Histogram records a duration-based histogram with the provided data.
// It adds a "histogram_scale" tag, so histograms can be accurately subset in queries or via middleware.
func (b Emitter) Histogram(name string, buckets SubsettableHistogram, dur time.Duration, meta Metadata) {
tags := make(DynamicTags, meta.NumTags()+1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: what's the +1 for?

// scale int
// }

func (s SubsettableHistogram) subsetTo(newScale int) SubsettableHistogram {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, possibly needing an explain-like-I'm-5: Under what conditions would we subset? What's it for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it's used above, but I think I'd prefer just to create the histogram explicitly from scratch - at least it took me like 10 minutes to understand what was going on. I'm still not completely sure what the semantics of newScale is


func (s SomethingTags) ItHappened(times int) {
tags := s.GetTags() // get all static tags
tags["reserved"] = fmt.Sprint(rand.Intn(10)) // add the reserved one(s)
Copy link
Member

@davidporter-id-au davidporter-id-au Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean to give an exmaple of falling into 1 of 10 buckets, it's a little confusing at first, I'd just used a fixed value or domain-parameter to reduce confusion and/or make it a bit more concrete.

The use of the word 'reserved' is conceptually overloaded with reserved keywords. Or alternatively... I admit I don't follow the intent of the tag here


flag.BoolVar(&VERBOSE, "v", false, "verbose output, e.g. print all types found")

log.SetFlags(log.Lshortfile) // TODO: I really can't stand this log package, replace?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho zap is a fine standard.

continue // empty lines are fine
}
words := strings.Fields(line)
if len(words) == 3 && words[1] == "success:" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is uh... not a super fun way to parse subprocesses' output... I guess exit codes don't work, we have no other options?

)
// emit the number of replication tasks
mScope.IncCounter(metrics.ReplicationTasksAppliedPerDomain)
p.perDomainTaskMetrics.taskProcessed(scope, domainName, startTime, replicationTask, mScope)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remark: this is quite a nice demonstration of the encapsulation that this model provides, it's a good deal easier to maintain imho

structured.DynamicOperationTags

TargetCluster string `tag:"target_cluster"`
Domain struct{} `tag:"domain"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why is this a struct?

Comment on lines +22 to +25
// Default1ms10m is our "default" set of buckets, targeting 1ms through 100s,
// and is "rounded up" slightly to reach 80 buckets == 16 minutes (100s needs 68 buckets),
// plus multi-minute exceptions are common enough to support for the small additional cost.
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here, it's called 1msto10min, but it's targeting to 100s (~1.5m) and it supports up to 16 minutes.

// scale int
// }

func (s SubsettableHistogram) subsetTo(newScale int) SubsettableHistogram {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it's used above, but I think I'd prefer just to create the histogram explicitly from scratch - at least it took me like 10 minutes to understand what was going on. I'm still not completely sure what the semantics of newScale is

Comment on lines +29 to +44
cmd := exec.Command(os.Args[0], append([]string{"-analyze"}, os.Args[1:]...)...)
out, _ := cmd.CombinedOutput()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run thd analyzer as a function instead of running a subprocess and then pasing the output?


// all metrics tags are dynamic per task and cannot be filled in up-front.
//
// skip:Convenience unable to use ad-hoc due to dynamic values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I got the intention of skip from reading the code, but it's not clear at all here what it means - it should be skip_generation or something like that

Comment on lines +125 to +126
func (p perDomainTaskMetricTags) taskProcessed(operation int, domain string, processingStart time.Time, task *types.ReplicationTask, legacyScope metrics.Scope) {
tags := p.GetTags(operation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to keep the operation ints? Are they not basically just pointers to strings?

@Groxx Groxx force-pushed the histogram-experimenting branch from 77eec8b to c30434b Compare August 27, 2025 21:50
@Groxx Groxx force-pushed the histogram-experimenting branch from c30434b to 2383e5d Compare September 3, 2025 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants