WIP: new metrics emitter and histogram strategy #7201

Groxx · 2025-08-22T03:14:58Z

Proof of concept on display.

This is an attempt to move away from our opaque and broadly-disliked metrics system, and towards two major changes:

use structs to show what tags are available roughly at all time, to make our metrics-emitting easier to understand and safer to change (new fields == build failures until fixed)
emit histograms rather than timers (this is just step 1 of like 1000+, to show the core tactic)

Once this runs in prod for a bit, so I have more data to play with, I'll build alerts and dashboards based on the new data, and we can check the runtime cost of these new metrics.
I've selected moderately-used and rather-expensive (high cardinality, high number of calls) metrics for this first one, to try to give us a realistic sample.

Groxx · 2025-08-22T03:17:41Z

service/history/replication/task_ack_manager.go

+	t.metrics.replicationLag(
+		t.ackLevels.UpdateIfNeededAndGetQueueMaxReadLevel(persistence.HistoryTaskCategoryReplication, pollingCluster),
+		taskInfos,
+		msgs,
+		t.scope,
+	)


I've done it this way mostly because:

the metrics-func is private and local to this file, so it's not too concerning to have it depend on excessively-large objects

each argument is a unique type, so it's not possible to pass them wrong (the main alternative is to pass three integers).

Groxx · 2025-08-22T03:21:28Z

common/metrics/structured/base.go

+var Module = fx.Provide(func(s tally.Scope) Emitter {
+	return Emitter{scope: s}
+})


this is the only constructor, outside tests. makes it fairly likely that I've got my fx / resource / context / etc injection correct.

common/metrics/structured/base.go

internal/tools/metricsgen/main.go

Groxx · 2025-08-26T02:08:26Z

common/metrics/defs_test.go

+// "index -> operation" must be unique for structured.DynamicOperationTags' int lookup to work consistently.
+// Duplicate indexes with the same operation name are technically fine, but there doesn't seem to be any benefit in allowing it,
+// and it trivially ensures that all indexes have only one operation name.


I could change this operation-lookup to also require a serviceIdx, if this change is not safe. But it seems probably-safe and a moderate bit less error-prone?
I have not thoroughly checked it though, so this is mostly just aspirational.

Groxx · 2025-08-26T02:11:00Z

common/metrics/defs_test.go

+				case CacheFullCounter, BaseCacheFullCounter:
+					checkIgnore(History, Common, CacheFullCounter, BaseCacheFullCounter)
+					continue
+				case CacheHitCounter, BaseCacheHit:


we have a bug report / slack message on this one, possibly others

Groxx · 2025-08-26T02:12:28Z

common/metrics/defs_test.go

+					checkIgnore(serviceIdx, serviceIdx, CrossClusterFetchFailures, CrossClusterTaskRespondFailures)
+					continue
+				case CadenceRequestsPerTaskList, CadenceRequestsPerTaskListWithoutRollup:
+					// arguably this one is fine


this really is "the same metric", it's just that one has a dual-emitted rollup and the other does not. I don't know why, or if they're even both used.

this might still be a source of mismatched tags / prometheus problems, as it implies two locations, but I haven't seen it specifically yet.

davidporter-id-au · 2025-08-26T02:33:27Z

cmd/server/go.mod

@@ -57,10 +57,10 @@ require (
 	go.uber.org/thriftrw v1.29.2 // indirect
 	go.uber.org/yarpc v1.70.3 // indirect
 	go.uber.org/zap v1.26.0
-	golang.org/x/net v0.38.0 // indirect


if possible it'd be good to pull the mod upgrades into a separate PR just due to the liklihood of a rollback

davidporter-id-au · 2025-08-26T02:34:20Z

common/metrics/defs.go

@@ -53,6 +53,10 @@ type (
 	ServiceIdx int
 )

+func (s scopeDefinition) GetOperationString() string {


mild nit: should this just be String? I thought that was a somewhat-commonly used go interface

davidporter-id-au · 2025-08-26T02:39:02Z

common/metrics/defs.go

@@ -1068,7 +1072,7 @@ const (
 // -- Operation scopes for History service --
 const (
 	// HistoryStartWorkflowExecutionScope tracks StartWorkflowExecution API calls received by service
-	HistoryStartWorkflowExecutionScope = iota + NumCommonScopes
+	HistoryStartWorkflowExecutionScope = iota + NumFrontendScopes


urgh.. yeah, this is way better.

davidporter-id-au · 2025-08-26T02:43:19Z

common/metrics/structured/base.go

+// Histogram records a duration-based histogram with the provided data.
+// It adds a "histogram_scale" tag, so histograms can be accurately subset in queries or via middleware.
+func (b Emitter) Histogram(name string, buckets SubsettableHistogram, dur time.Duration, meta Metadata) {
+	tags := make(DynamicTags, meta.NumTags()+1)


q: what's the +1 for?

davidporter-id-au · 2025-08-26T03:46:54Z

common/metrics/structured/histograms.go

+// 		scale int
+// 	}
+
+func (s SubsettableHistogram) subsetTo(newScale int) SubsettableHistogram {


Question, possibly needing an explain-like-I'm-5: Under what conditions would we subset? What's it for?

I see it's used above, but I think I'd prefer just to create the histogram explicitly from scratch - at least it took me like 10 minutes to understand what was going on. I'm still not completely sure what the semantics of newScale is

davidporter-id-au · 2025-08-26T04:07:12Z

common/metrics/structured/doc.go

+
+	func (s SomethingTags) ItHappened(times int) {
+		tags := s.GetTags()                          // get all static tags
+		tags["reserved"] = fmt.Sprint(rand.Intn(10)) // add the reserved one(s)


I assume you mean to give an exmaple of falling into 1 of 10 buckets, it's a little confusing at first, I'd just used a fixed value or domain-parameter to reduce confusion and/or make it a bit more concrete.

The use of the word 'reserved' is conceptually overloaded with reserved keywords. Or alternatively... I admit I don't follow the intent of the tag here

davidporter-id-au · 2025-08-26T04:21:06Z

internal/tools/metricsgen/main.go

+
+	flag.BoolVar(&VERBOSE, "v", false, "verbose output, e.g. print all types found")
+
+	log.SetFlags(log.Lshortfile) // TODO: I really can't stand this log package, replace?


imho zap is a fine standard.

davidporter-id-au · 2025-08-26T04:26:18Z

internal/tools/metricslint/cmd/main.go

+			continue // empty lines are fine
+		}
+		words := strings.Fields(line)
+		if len(words) == 3 && words[1] == "success:" {


this is uh... not a super fun way to parse subprocesses' output... I guess exit codes don't work, we have no other options?

davidporter-id-au · 2025-08-26T04:31:17Z

service/history/replication/task_processor.go

-		)
-		// emit the number of replication tasks
-		mScope.IncCounter(metrics.ReplicationTasksAppliedPerDomain)
+		p.perDomainTaskMetrics.taskProcessed(scope, domainName, startTime, replicationTask, mScope)


remark: this is quite a nice demonstration of the encapsulation that this model provides, it's a good deal easier to maintain imho

davidporter-id-au · 2025-08-26T04:32:20Z

service/history/replication/task_processor.go

+		structured.DynamicOperationTags
+
+		TargetCluster string   `tag:"target_cluster"`
+		Domain        struct{} `tag:"domain"`


question: why is this a struct?

jakobht · 2025-08-26T09:26:15Z

common/metrics/structured/histograms.go

+	// Default1ms10m is our "default" set of buckets, targeting 1ms through 100s,
+	// and is "rounded up" slightly to reach 80 buckets == 16 minutes (100s needs 68 buckets),
+	// plus multi-minute exceptions are common enough to support for the small additional cost.
+	//


I'm a bit confused here, it's called 1msto10min, but it's targeting to 100s (~1.5m) and it supports up to 16 minutes.

jakobht · 2025-08-26T09:34:44Z

common/metrics/structured/histograms.go

+// 		scale int
+// 	}
+
+func (s SubsettableHistogram) subsetTo(newScale int) SubsettableHistogram {


I see it's used above, but I think I'd prefer just to create the histogram explicitly from scratch - at least it took me like 10 minutes to understand what was going on. I'm still not completely sure what the semantics of newScale is

jakobht · 2025-08-26T11:52:53Z

internal/tools/metricslint/cmd/main.go

+	cmd := exec.Command(os.Args[0], append([]string{"-analyze"}, os.Args[1:]...)...)
+	out, _ := cmd.CombinedOutput()


Can we run thd analyzer as a function instead of running a subprocess and then pasing the output?

jakobht · 2025-08-26T11:56:44Z

service/history/replication/task_processor.go

+
+	// all metrics tags are dynamic per task and cannot be filled in up-front.
+	//
+	// skip:Convenience unable to use ad-hoc due to dynamic values


So I got the intention of skip from reading the code, but it's not clear at all here what it means - it should be skip_generation or something like that

jakobht · 2025-08-26T13:15:49Z

service/history/replication/task_processor.go

+func (p perDomainTaskMetricTags) taskProcessed(operation int, domain string, processingStart time.Time, task *types.ReplicationTask, legacyScope metrics.Scope) {
+	tags := p.GetTags(operation)


Do we plan to keep the operation ints? Are they not basically just pointers to strings?

Groxx commented Aug 22, 2025

View reviewed changes

common/metrics/structured/base.go Show resolved Hide resolved

Groxx commented Aug 22, 2025

View reviewed changes

internal/tools/metricsgen/main.go Show resolved Hide resolved

Groxx commented Aug 26, 2025

View reviewed changes

davidporter-id-au reviewed Aug 26, 2025

View reviewed changes

jakobht reviewed Aug 26, 2025

View reviewed changes

Groxx force-pushed the histogram-experimenting branch from 77eec8b to c30434b Compare August 27, 2025 21:50

Groxx added 4 commits August 27, 2025 16:56

(move to front) change metrics defs to have unique integers

026f72f

base metrics objects

ac984ed

use the new metrics thing

97ede4c

new tools

2383e5d

Groxx force-pushed the histogram-experimenting branch from c30434b to 2383e5d Compare September 3, 2025 23:52


		flag.BoolVar(&VERBOSE, "v", false, "verbose output, e.g. print all types found")

		log.SetFlags(log.Lshortfile) // TODO: I really can't stand this log package, replace?

		cmd := exec.Command(os.Args[0], append([]string{"-analyze"}, os.Args[1:]...)...)
		out, _ := cmd.CombinedOutput()

		func (p perDomainTaskMetricTags) taskProcessed(operation int, domain string, processingStart time.Time, task *types.ReplicationTask, legacyScope metrics.Scope) {
		tags := p.GetTags(operation)

WIP: new metrics emitter and histogram strategy #7201

Are you sure you want to change the base?

WIP: new metrics emitter and histogram strategy #7201

Conversation

Groxx commented Aug 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Groxx Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidporter-id-au Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Groxx Aug 26, 2025 •

edited

Loading

Groxx Aug 26, 2025 •

edited

Loading

Groxx Aug 26, 2025 •

edited

Loading

davidporter-id-au Aug 26, 2025 •

edited

Loading