-
Notifications
You must be signed in to change notification settings - Fork 853
WIP: new metrics emitter and histogram strategy #7201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
t.metrics.replicationLag( | ||
t.ackLevels.UpdateIfNeededAndGetQueueMaxReadLevel(persistence.HistoryTaskCategoryReplication, pollingCluster), | ||
taskInfos, | ||
msgs, | ||
t.scope, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done it this way mostly because:
- the metrics-func is private and local to this file, so it's not too concerning to have it depend on excessively-large objects
- each argument is a unique type, so it's not possible to pass them wrong (the main alternative is to pass three integers).
common/metrics/structured/base.go
Outdated
var Module = fx.Provide(func(s tally.Scope) Emitter { | ||
return Emitter{scope: s} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only constructor, outside tests. makes it fairly likely that I've got my fx / resource / context / etc injection correct.
// "index -> operation" must be unique for structured.DynamicOperationTags' int lookup to work consistently. | ||
// Duplicate indexes with the same operation name are technically fine, but there doesn't seem to be any benefit in allowing it, | ||
// and it trivially ensures that all indexes have only one operation name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change this operation-lookup to also require a serviceIdx
, if this change is not safe. But it seems probably-safe and a moderate bit less error-prone?
I have not thoroughly checked it though, so this is mostly just aspirational.
common/metrics/defs_test.go
Outdated
case CacheFullCounter, BaseCacheFullCounter: | ||
checkIgnore(History, Common, CacheFullCounter, BaseCacheFullCounter) | ||
continue | ||
case CacheHitCounter, BaseCacheHit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a bug report / slack message on this one, possibly others
common/metrics/defs_test.go
Outdated
checkIgnore(serviceIdx, serviceIdx, CrossClusterFetchFailures, CrossClusterTaskRespondFailures) | ||
continue | ||
case CadenceRequestsPerTaskList, CadenceRequestsPerTaskListWithoutRollup: | ||
// arguably this one is fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this really is "the same metric", it's just that one has a dual-emitted rollup and the other does not. I don't know why, or if they're even both used.
this might still be a source of mismatched tags / prometheus problems, as it implies two locations, but I haven't seen it specifically yet.
cmd/server/go.mod
Outdated
@@ -57,10 +57,10 @@ require ( | |||
go.uber.org/thriftrw v1.29.2 // indirect | |||
go.uber.org/yarpc v1.70.3 // indirect | |||
go.uber.org/zap v1.26.0 | |||
golang.org/x/net v0.38.0 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible it'd be good to pull the mod upgrades into a separate PR just due to the liklihood of a rollback
@@ -53,6 +53,10 @@ type ( | |||
ServiceIdx int | |||
) | |||
|
|||
func (s scopeDefinition) GetOperationString() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mild nit: should this just be String
? I thought that was a somewhat-commonly used go interface
@@ -1068,7 +1072,7 @@ const ( | |||
// -- Operation scopes for History service -- | |||
const ( | |||
// HistoryStartWorkflowExecutionScope tracks StartWorkflowExecution API calls received by service | |||
HistoryStartWorkflowExecutionScope = iota + NumCommonScopes | |||
HistoryStartWorkflowExecutionScope = iota + NumFrontendScopes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
urgh.. yeah, this is way better.
// Histogram records a duration-based histogram with the provided data. | ||
// It adds a "histogram_scale" tag, so histograms can be accurately subset in queries or via middleware. | ||
func (b Emitter) Histogram(name string, buckets SubsettableHistogram, dur time.Duration, meta Metadata) { | ||
tags := make(DynamicTags, meta.NumTags()+1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: what's the +1 for?
// scale int | ||
// } | ||
|
||
func (s SubsettableHistogram) subsetTo(newScale int) SubsettableHistogram { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, possibly needing an explain-like-I'm-5: Under what conditions would we subset? What's it for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it's used above, but I think I'd prefer just to create the histogram explicitly from scratch - at least it took me like 10 minutes to understand what was going on. I'm still not completely sure what the semantics of newScale
is
|
||
func (s SomethingTags) ItHappened(times int) { | ||
tags := s.GetTags() // get all static tags | ||
tags["reserved"] = fmt.Sprint(rand.Intn(10)) // add the reserved one(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean to give an exmaple of falling into 1 of 10 buckets, it's a little confusing at first, I'd just used a fixed value or domain-parameter to reduce confusion and/or make it a bit more concrete.
The use of the word 'reserved' is conceptually overloaded with reserved keywords. Or alternatively... I admit I don't follow the intent of the tag here
|
||
flag.BoolVar(&VERBOSE, "v", false, "verbose output, e.g. print all types found") | ||
|
||
log.SetFlags(log.Lshortfile) // TODO: I really can't stand this log package, replace? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho zap is a fine standard.
continue // empty lines are fine | ||
} | ||
words := strings.Fields(line) | ||
if len(words) == 3 && words[1] == "success:" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is uh... not a super fun way to parse subprocesses' output... I guess exit codes don't work, we have no other options?
) | ||
// emit the number of replication tasks | ||
mScope.IncCounter(metrics.ReplicationTasksAppliedPerDomain) | ||
p.perDomainTaskMetrics.taskProcessed(scope, domainName, startTime, replicationTask, mScope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remark: this is quite a nice demonstration of the encapsulation that this model provides, it's a good deal easier to maintain imho
structured.DynamicOperationTags | ||
|
||
TargetCluster string `tag:"target_cluster"` | ||
Domain struct{} `tag:"domain"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why is this a struct?
// Default1ms10m is our "default" set of buckets, targeting 1ms through 100s, | ||
// and is "rounded up" slightly to reach 80 buckets == 16 minutes (100s needs 68 buckets), | ||
// plus multi-minute exceptions are common enough to support for the small additional cost. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here, it's called 1msto10min, but it's targeting to 100s (~1.5m) and it supports up to 16 minutes.
// scale int | ||
// } | ||
|
||
func (s SubsettableHistogram) subsetTo(newScale int) SubsettableHistogram { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it's used above, but I think I'd prefer just to create the histogram explicitly from scratch - at least it took me like 10 minutes to understand what was going on. I'm still not completely sure what the semantics of newScale
is
cmd := exec.Command(os.Args[0], append([]string{"-analyze"}, os.Args[1:]...)...) | ||
out, _ := cmd.CombinedOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we run thd analyzer as a function instead of running a subprocess and then pasing the output?
|
||
// all metrics tags are dynamic per task and cannot be filled in up-front. | ||
// | ||
// skip:Convenience unable to use ad-hoc due to dynamic values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I got the intention of skip from reading the code, but it's not clear at all here what it means - it should be skip_generation
or something like that
func (p perDomainTaskMetricTags) taskProcessed(operation int, domain string, processingStart time.Time, task *types.ReplicationTask, legacyScope metrics.Scope) { | ||
tags := p.GetTags(operation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to keep the operation ints? Are they not basically just pointers to strings?
77eec8b
to
c30434b
Compare
c30434b
to
2383e5d
Compare
Proof of concept on display.
This is an attempt to move away from our opaque and broadly-disliked metrics system, and towards two major changes:
Once this runs in prod for a bit, so I have more data to play with, I'll build alerts and dashboards based on the new data, and we can check the runtime cost of these new metrics.
I've selected moderately-used and rather-expensive (high cardinality, high number of calls) metrics for this first one, to try to give us a realistic sample.