-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Optimize AddCaller+AddStacktrace by sharing stacks #1052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds a benchmark that tests the performance of a Zap logger with both, `AddCaller` and `AddStacktrace` options enabled.
If we use the `AddCaller(true)` option for a logger with
`AddStacktrace(level)`, which captures a stack trace for the specified
level or higher, Zap currently captures inspects the stack twice:
- `getCallerFrame` retrieves information about the immediate caller of
the log entry
- `StackSkip` calls `takeStacktrace` to capture the entire stack trace
and build a string from it
For the case where both caller and stack trace are requested,
we can avoid redundant work by sharing information about the call stack
between the logic for the two.
To accomplish this, the following high-level pieces were added:
type stacktrace
captureStacktrace(skip int, depth stacktraceDepth) *stacktrace
func (*stacktrace) Next() (runtime.Frame, bool)
type stackFormatter
func newStackFormatter(*buffer.Buffer)
func (*stackFormatter) FormatStack(*stacktrace)
func (*stackFormatter) FormatFrame(runtime.Frame)
`captureStacktrace` captures stack traces (either just one frame or the
entire stack) that can then be inspected manually (to extract caller
information) and formatted into a string representation either
one-frame-at-a-time or wholesale with `stackFormatter`.
This allows us reorganize logic that applies the AddCaller and
AddStacktrace options so that it can:
- capture as much of the stack as it needs
- extract caller information from the first frame if we need it
- format that frame and the remaining frames into a stack trace
---
Benchmarking:
I ran the included benchmark before and after the patch with the
following flags and compared the results with benchstat.
```
% go test -run '^$' -bench AddCallerAndStack -benchmem -count 10
[...]
% benchstat before.txt after.txt
name old time/op new time/op delta
AddCallerAndStacktrace-2 4.68µs ± 7% 3.63µs ± 7% -22.35% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
AddCallerAndStacktrace-2 632B ± 0% 416B ± 0% -34.18% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
AddCallerAndStacktrace-2 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=10+10)
```
Allocations for this code path are down from 5 to 2, and CPU down by
~20%.
To check for regressions, I also ran all existing benchmarks with
"Caller" or "Stack" in their names:
```
% go test -run '^$' -bench 'Caller|Stack' -benchmem -count 10
[...]
% benchstat before.txt after.txt
name old time/op new time/op delta
StackField-2 3.28µs ± 2% 3.49µs ± 2% +6.38% (p=0.000 n=9+9)
AddCallerHook-2 1.89µs ± 2% 1.92µs ± 3% ~ (p=0.055 n=10+9)
TakeStacktrace-2 3.17µs ± 1% 3.60µs ± 8% +13.63% (p=0.000 n=8+9)
name old alloc/op new alloc/op delta
StackField-2 528B ± 0% 528B ± 0% ~ (all equal)
AddCallerHook-2 248B ± 0% 240B ± 0% -3.23% (p=0.000 n=10+10)
TakeStacktrace-2 416B ± 0% 416B ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
StackField-2 3.00 ± 0% 3.00 ± 0% ~ (all equal)
AddCallerHook-2 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=10+10)
TakeStacktrace-2 2.00 ± 0% 2.00 ± 0% ~ (all equal)
```
AddCaller costs one less allocation, but the cost of capturing a stack
trace is *slightly* higher. I think that may be worth it for the 20%+
improvement on Caller+Stack and reduced allocations.
Codecov Report
@@ Coverage Diff @@
## master #1052 +/- ##
==========================================
+ Coverage 98.22% 98.29% +0.07%
==========================================
Files 48 48
Lines 2080 2111 +31
==========================================
+ Hits 2043 2075 +32
+ Misses 29 28 -1
Partials 8 8
Continue to review full report at Codecov.
|
prashantv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good, and it seems like not reusing CallersFrames actually helps bring performance back to master levels for TakeStacktrace,
#1053
| } else { | ||
| stack.pcs = stack.pcs[:numFrames] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to branch here in the stackTraceFirst case, the loop will just be a no-op right? (or is there a performance difference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded offline: Yes, we need the branch here for when numFrames=0
The `runtime.CallersFrames` method doesn't do much other than returning a struct pointer wrapping the passed in `pcs`. Calling this method twice ends up simplifying some of the logic (not having to format the first stack separately from the remaining), and ends up brining the performance of `TakeStacktrace` similar to `master`. ``` # benchstat comparing 3 runs of master vs this branch, each run using benchtime=10s name old time/op new time/op delta TakeStacktrace 1.47µs ± 2% 1.48µs ± 2% ~ (p=1.000 n=3+3) name old alloc/op new alloc/op delta TakeStacktrace 496B ± 0% 496B ± 0% ~ (all equal) name old allocs/op new allocs/op delta TakeStacktrace 2.00 ± 0% 2.00 ± 0% ~ (all equal) ```
|
@prashantv I've updated the numbers above after incorporating #1053. And on my dev server with |
This reverts commit 9cbe14e.
|
Reverting #1053. It does not appear to be an improvement after all. |
sywhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM minus couple nits
stacktrace.go
Outdated
| pcs []uintptr | ||
| // stackFormatter formats a stack trace into a readable string representation. | ||
| type stackFormatter struct { | ||
| i int // number of frames already written |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this can be a bool since we only use this for checking if there were already frames written or not in FormatFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replacing with a bool
Co-authored-by: Sung Yoon Whang <[email protected]>
Track whether we've written at least one frame, and use that to decide if we need the prefix.
(Supersedes #808.)
If we use the
AddCaller(true)option for a logger withAddStacktrace(level), which captures a stack trace for the specifiedlevel or higher, Zap currently captures inspects the stack twice:
getCallerFrameretrieves information about the immediate caller ofthe log entry
StackSkipcallstakeStacktraceto capture the entire stack traceand build a string from it
For the case where both caller and stack trace are requested,
we can avoid redundant work by sharing information about the call stack
between the logic for the two.
To accomplish this, the following high-level pieces were added:
captureStacktracecaptures stack traces (either just one frame or theentire stack) that can then be inspected manually (to extract caller
information) and formatted into a string representation either
one-frame-at-a-time or wholesale with
stackFormatter.This allows us reorganize logic that applies the AddCaller and
AddStacktrace options so that it can:
Benchmarking:
I ran the new benchmark included in this PR, as well as the existing
stack trace/caller based benchmarks 10 times for 10 seconds each.
Summary:
I think that the overall improvement is worth the slightly more expensive stack traces.
Refs GO-1181