Skip to content

Introduced trace post-processing #6800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 16, 2024
Merged

Conversation

ValentinZakharov
Copy link
Contributor

@ValentinZakharov ValentinZakharov commented Mar 12, 2024

What Does This Do

Introduced the Trace Post-Processing - experimental feature.
It allows to process trace at very late stage (in dd-trace-processor thread), before serialization and dispatching to Agent.
The trace processor, check if there is spans that requires post-processing and perform SpanPostProcessor for each of them with timeout. Timeout used for case, when post-processing takes too long and need to be dropped.
The post-processing happens only for span context marked as requiresPostProcessing.

post-processing

Also added extra configuration options:

  • trace.post-processing.timeout - to set timeout for post-processing thread (default: 1000ms)

Motivation

The idea is to have mechanism that can safely do extra computations without bringing overhead in requests latency.
This is preparation for new API Security sampling, where heavy schema extraction should happen after the request finished when we have information about span sampling priority. We intent to compute schema only if span will be retained.

Additional Notes

@ValentinZakharov ValentinZakharov added the comp: asm waf Application Security Management (WAF) label Mar 12, 2024
@ValentinZakharov ValentinZakharov self-assigned this Mar 12, 2024
@ValentinZakharov ValentinZakharov force-pushed the vzakharov/post-processing branch from c340d45 to adf13d0 Compare March 12, 2024 00:48
@pr-commenter
Copy link

pr-commenter bot commented Mar 12, 2024

Benchmarks

Startup

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master vzakharov/post-processing
git_commit_date 1713289990 1713294465
git_commit_sha 90a6e41 ae1ecea
release_version 1.33.0-SNAPSHOT~90a6e4126b 1.33.0-SNAPSHOT~ae1ecea607
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1713298013 1713298013
ci_job_id 488444647 488444647
ci_pipeline_id 32245015 32245015
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module Agent Agent
parent None None
variant iast iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 49 metrics, 14 unstable metrics.

Startup time reports for petclinic
gantt
    title petclinic - global startup overhead: candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.076 s) : 0, 1075677
Total [baseline] (10.386 s) : 0, 10385832
Agent [candidate] (1.082 s) : 0, 1081685
Total [candidate] (10.41 s) : 0, 10409678
section appsec
Agent [baseline] (1.205 s) : 0, 1204729
Total [baseline] (10.511 s) : 0, 10511249
Agent [candidate] (1.205 s) : 0, 1204655
Total [candidate] (10.435 s) : 0, 10435305
section iast
Agent [baseline] (1.199 s) : 0, 1198556
Total [baseline] (10.707 s) : 0, 10707283
Agent [candidate] (1.198 s) : 0, 1197673
Total [candidate] (10.837 s) : 0, 10837312
section profiling
Agent [baseline] (1.266 s) : 0, 1265901
Total [baseline] (10.624 s) : 0, 10624017
Agent [candidate] (1.268 s) : 0, 1268075
Total [candidate] (10.656 s) : 0, 10655848
Loading
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.076 s -
Agent appsec 1.205 s 129.052 ms (12.0%)
Agent iast 1.199 s 122.879 ms (11.4%)
Agent profiling 1.266 s 190.224 ms (17.7%)
Total tracing 10.386 s -
Total appsec 10.511 s 125.418 ms (1.2%)
Total iast 10.707 s 321.451 ms (3.1%)
Total profiling 10.624 s 238.185 ms (2.3%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.082 s -
Agent appsec 1.205 s 122.97 ms (11.4%)
Agent iast 1.198 s 115.988 ms (10.7%)
Agent profiling 1.268 s 186.39 ms (17.2%)
Total tracing 10.41 s -
Total appsec 10.435 s 25.628 ms (0.2%)
Total iast 10.837 s 427.634 ms (4.1%)
Total profiling 10.656 s 246.17 ms (2.4%)
gantt
    title petclinic - break down per module: candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (673.774 ms) : 0, 673774
BytebuddyAgent [candidate] (678.246 ms) : 0, 678246
GlobalTracer [baseline] (309.564 ms) : 0, 309564
GlobalTracer [candidate] (311.037 ms) : 0, 311037
AppSec [baseline] (49.642 ms) : 0, 49642
AppSec [candidate] (49.58 ms) : 0, 49580
Remote Config [baseline] (666.389 µs) : 0, 666
Remote Config [candidate] (655.047 µs) : 0, 655
Telemetry [baseline] (7.614 ms) : 0, 7614
Telemetry [candidate] (7.534 ms) : 0, 7534
section appsec
BytebuddyAgent [baseline] (699.616 ms) : 0, 699616
BytebuddyAgent [candidate] (699.182 ms) : 0, 699182
GlobalTracer [baseline] (292.955 ms) : 0, 292955
GlobalTracer [candidate] (292.925 ms) : 0, 292925
AppSec [baseline] (150.216 ms) : 0, 150216
AppSec [candidate] (149.764 ms) : 0, 149764
IAST [baseline] (19.177 ms) : 0, 19177
IAST [candidate] (18.868 ms) : 0, 18868
Remote Config [baseline] (611.363 µs) : 0, 611
Remote Config [candidate] (610.55 µs) : 0, 611
Telemetry [baseline] (7.489 ms) : 0, 7489
Telemetry [candidate] (8.735 ms) : 0, 8735
section iast
BytebuddyAgent [baseline] (794.453 ms) : 0, 794453
BytebuddyAgent [candidate] (793.569 ms) : 0, 793569
GlobalTracer [baseline] (287.88 ms) : 0, 287880
GlobalTracer [candidate] (288.583 ms) : 0, 288583
AppSec [baseline] (51.607 ms) : 0, 51607
AppSec [candidate] (49.523 ms) : 0, 49523
IAST [baseline] (22.441 ms) : 0, 22441
IAST [candidate] (22.822 ms) : 0, 22822
Remote Config [baseline] (561.653 µs) : 0, 562
Remote Config [candidate] (591.211 µs) : 0, 591
Telemetry [baseline] (7.32 ms) : 0, 7320
Telemetry [candidate] (8.234 ms) : 0, 8234
section profiling
BytebuddyAgent [baseline] (676.142 ms) : 0, 676142
BytebuddyAgent [candidate] (676.922 ms) : 0, 676922
GlobalTracer [baseline] (379.506 ms) : 0, 379506
GlobalTracer [candidate] (381.176 ms) : 0, 381176
AppSec [baseline] (50.116 ms) : 0, 50116
AppSec [candidate] (50.104 ms) : 0, 50104
Remote Config [baseline] (723.43 µs) : 0, 723
Remote Config [candidate] (706.439 µs) : 0, 706
Telemetry [baseline] (7.427 ms) : 0, 7427
Telemetry [candidate] (7.483 ms) : 0, 7483
ProfilingAgent [baseline] (95.706 ms) : 0, 95706
ProfilingAgent [candidate] (95.438 ms) : 0, 95438
Profiling [baseline] (95.73 ms) : 0, 95730
Profiling [candidate] (95.462 ms) : 0, 95462
Loading
Startup time reports for insecure-bank
gantt
    title insecure-bank - global startup overhead: candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.081 s) : 0, 1081499
Total [baseline] (8.559 s) : 0, 8558976
Agent [candidate] (1.099 s) : 0, 1099098
Total [candidate] (8.639 s) : 0, 8639062
section iast
Agent [baseline] (1.196 s) : 0, 1196121
Total [baseline] (9.001 s) : 0, 9000551
Agent [candidate] (1.2 s) : 0, 1199857
Total [candidate] (9.005 s) : 0, 9005325
section iast_HARDCODED_SECRET_DISABLED
Agent [baseline] (1.206 s) : 0, 1206467
Total [baseline] (8.978 s) : 0, 8977733
Agent [candidate] (1.2 s) : 0, 1199716
Total [candidate] (8.98 s) : 0, 8980289
section iast_TELEMETRY_OFF
Agent [baseline] (1.196 s) : 0, 1196195
Total [baseline] (9.012 s) : 0, 9011770
Agent [candidate] (1.207 s) : 0, 1206999
Total [candidate] (9.067 s) : 0, 9067034
Loading
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.081 s -
Agent iast 1.196 s 114.623 ms (10.6%)
Agent iast_HARDCODED_SECRET_DISABLED 1.206 s 124.969 ms (11.6%)
Agent iast_TELEMETRY_OFF 1.196 s 114.697 ms (10.6%)
Total tracing 8.559 s -
Total iast 9.001 s 441.575 ms (5.2%)
Total iast_HARDCODED_SECRET_DISABLED 8.978 s 418.757 ms (4.9%)
Total iast_TELEMETRY_OFF 9.012 s 452.794 ms (5.3%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.099 s -
Agent iast 1.2 s 100.759 ms (9.2%)
Agent iast_HARDCODED_SECRET_DISABLED 1.2 s 100.618 ms (9.2%)
Agent iast_TELEMETRY_OFF 1.207 s 107.901 ms (9.8%)
Total tracing 8.639 s -
Total iast 9.005 s 366.263 ms (4.2%)
Total iast_HARDCODED_SECRET_DISABLED 8.98 s 341.227 ms (3.9%)
Total iast_TELEMETRY_OFF 9.067 s 427.972 ms (5.0%)
gantt
    title insecure-bank - break down per module: candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (677.732 ms) : 0, 677732
BytebuddyAgent [candidate] (688.508 ms) : 0, 688508
GlobalTracer [baseline] (311.152 ms) : 0, 311152
GlobalTracer [candidate] (316.532 ms) : 0, 316532
AppSec [baseline] (49.781 ms) : 0, 49781
AppSec [candidate] (50.484 ms) : 0, 50484
Remote Config [baseline] (663.623 µs) : 0, 664
Remote Config [candidate] (681.641 µs) : 0, 682
Telemetry [baseline] (7.578 ms) : 0, 7578
Telemetry [candidate] (7.723 ms) : 0, 7723
section iast
BytebuddyAgent [baseline] (792.546 ms) : 0, 792546
BytebuddyAgent [candidate] (794.825 ms) : 0, 794825
GlobalTracer [baseline] (287.101 ms) : 0, 287101
GlobalTracer [candidate] (288.854 ms) : 0, 288854
AppSec [baseline] (48.84 ms) : 0, 48840
AppSec [candidate] (50.179 ms) : 0, 50179
IAST [baseline] (23.916 ms) : 0, 23916
IAST [candidate] (22.138 ms) : 0, 22138
Remote Config [baseline] (569.674 µs) : 0, 570
Remote Config [candidate] (574.756 µs) : 0, 575
Telemetry [baseline] (8.888 ms) : 0, 8888
Telemetry [candidate] (8.924 ms) : 0, 8924
section iast_HARDCODED_SECRET_DISABLED
BytebuddyAgent [baseline] (800.093 ms) : 0, 800093
BytebuddyAgent [candidate] (795.162 ms) : 0, 795162
GlobalTracer [baseline] (289.549 ms) : 0, 289549
GlobalTracer [candidate] (288.548 ms) : 0, 288548
AppSec [baseline] (51.472 ms) : 0, 51472
AppSec [candidate] (50.821 ms) : 0, 50821
IAST [baseline] (22.875 ms) : 0, 22875
IAST [candidate] (21.997 ms) : 0, 21997
Remote Config [baseline] (579.551 µs) : 0, 580
Remote Config [candidate] (612.189 µs) : 0, 612
Telemetry [baseline] (7.331 ms) : 0, 7331
Telemetry [candidate] (8.171 ms) : 0, 8171
section iast_TELEMETRY_OFF
BytebuddyAgent [baseline] (792.158 ms) : 0, 792158
BytebuddyAgent [candidate] (799.567 ms) : 0, 799567
GlobalTracer [baseline] (287.819 ms) : 0, 287819
GlobalTracer [candidate] (291.483 ms) : 0, 291483
AppSec [baseline] (51.962 ms) : 0, 51962
AppSec [candidate] (49.858 ms) : 0, 49858
IAST [baseline] (22.644 ms) : 0, 22644
IAST [candidate] (21.987 ms) : 0, 21987
Remote Config [baseline] (586.448 µs) : 0, 586
Remote Config [candidate] (591.283 µs) : 0, 591
Telemetry [baseline] (6.649 ms) : 0, 6649
Telemetry [candidate] (8.896 ms) : 0, 8896
Loading

Load

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
end_time 2024-04-16T19:39:09 2024-04-16T20:01:07
git_branch master vzakharov/post-processing
git_commit_date 1713289990 1713294465
git_commit_sha 90a6e41 ae1ecea
release_version 1.33.0-SNAPSHOT~90a6e4126b 1.33.0-SNAPSHOT~ae1ecea607
start_time 2024-04-16T19:38:56 2024-04-16T20:00:54
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1713298013 1713298013
ci_job_id 488444647 488444647
ci_pipeline_id 32245015 32245015
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant iast iast

Summary

Found 0 performance improvements and 1 performance regressions! Performance is the same for 10 metrics, 17 unstable metrics.

scenario Δ mean http_req_duration Δ mean throughput candidate mean http_req_duration candidate mean throughput baseline mean http_req_duration baseline mean throughput
scenario:load:petclinic:profiling worse
[+36.076µs; +90.212µs] or [+2.420%; +6.052%]
unstable
[-552.081op/s; +552.081op/s] or [-17.943%; +17.943%]
1.554ms 3076.923op/s 1.491ms 3076.923op/s
Request duration reports for petclinic
gantt
    title petclinic - request duration [CI 0.99] : candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.326 ms) : 1307, 1345
.   : milestone, 1326,
appsec (1.717 ms) : 1692, 1742
.   : milestone, 1717,
appsec_no_iast (1.73 ms) : 1706, 1754
.   : milestone, 1730,
iast (1.49 ms) : 1467, 1513
.   : milestone, 1490,
profiling (1.491 ms) : 1466, 1515
.   : milestone, 1491,
tracing (1.455 ms) : 1430, 1479
.   : milestone, 1455,
section candidate
no_agent (1.344 ms) : 1324, 1363
.   : milestone, 1344,
appsec (1.684 ms) : 1659, 1709
.   : milestone, 1684,
appsec_no_iast (1.732 ms) : 1709, 1755
.   : milestone, 1732,
iast (1.498 ms) : 1476, 1520
.   : milestone, 1498,
profiling (1.554 ms) : 1528, 1580
.   : milestone, 1554,
tracing (1.506 ms) : 1481, 1532
.   : milestone, 1506,
Loading
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.326 ms [1.307 ms, 1.345 ms] -
appsec 1.717 ms [1.692 ms, 1.742 ms] 391.342 µs (29.5%)
appsec_no_iast 1.73 ms [1.706 ms, 1.754 ms] 403.577 µs (30.4%)
iast 1.49 ms [1.467 ms, 1.513 ms] 164.032 µs (12.4%)
profiling 1.491 ms [1.466 ms, 1.515 ms] 164.45 µs (12.4%)
tracing 1.455 ms [1.43 ms, 1.479 ms] 128.759 µs (9.7%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.344 ms [1.324 ms, 1.363 ms] -
appsec 1.684 ms [1.659 ms, 1.709 ms] 340.408 µs (25.3%)
appsec_no_iast 1.732 ms [1.709 ms, 1.755 ms] 388.355 µs (28.9%)
iast 1.498 ms [1.476 ms, 1.52 ms] 154.198 µs (11.5%)
profiling 1.554 ms [1.528 ms, 1.58 ms] 210.052 µs (15.6%)
tracing 1.506 ms [1.481 ms, 1.532 ms] 162.68 µs (12.1%)
Request duration reports for insecure-bank
gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b
    dateFormat X
    axisFormat %s
section baseline
no_agent (359.641 µs) : 340, 379
.   : milestone, 360,
iast (476.093 µs) : 455, 497
.   : milestone, 476,
iast_FULL (535.183 µs) : 514, 557
.   : milestone, 535,
iast_GLOBAL (500.211 µs) : 478, 522
.   : milestone, 500,
iast_HARDCODED_SECRET_DISABLED (472.058 µs) : 451, 493
.   : milestone, 472,
iast_INACTIVE (443.423 µs) : 423, 464
.   : milestone, 443,
iast_TELEMETRY_OFF (461.946 µs) : 442, 482
.   : milestone, 462,
tracing (442.266 µs) : 421, 463
.   : milestone, 442,
section candidate
no_agent (364.933 µs) : 345, 385
.   : milestone, 365,
iast (473.102 µs) : 451, 495
.   : milestone, 473,
iast_FULL (533.563 µs) : 513, 554
.   : milestone, 534,
iast_GLOBAL (492.803 µs) : 471, 514
.   : milestone, 493,
iast_HARDCODED_SECRET_DISABLED (467.422 µs) : 447, 488
.   : milestone, 467,
iast_INACTIVE (444.682 µs) : 423, 466
.   : milestone, 445,
iast_TELEMETRY_OFF (464.09 µs) : 443, 485
.   : milestone, 464,
tracing (440.667 µs) : 420, 462
.   : milestone, 441,
Loading
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 359.641 µs [339.939 µs, 379.342 µs] -
iast 476.093 µs [455.289 µs, 496.896 µs] 116.452 µs (32.4%)
iast_FULL 535.183 µs [513.697 µs, 556.669 µs] 175.542 µs (48.8%)
iast_GLOBAL 500.211 µs [478.247 µs, 522.175 µs] 140.57 µs (39.1%)
iast_HARDCODED_SECRET_DISABLED 472.058 µs [451.084 µs, 493.031 µs] 112.417 µs (31.3%)
iast_INACTIVE 443.423 µs [422.633 µs, 464.212 µs] 83.782 µs (23.3%)
iast_TELEMETRY_OFF 461.946 µs [441.621 µs, 482.271 µs] 102.305 µs (28.4%)
tracing 442.266 µs [421.356 µs, 463.175 µs] 82.625 µs (23.0%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 364.933 µs [344.677 µs, 385.189 µs] -
iast 473.102 µs [451.245 µs, 494.96 µs] 108.17 µs (29.6%)
iast_FULL 533.563 µs [512.647 µs, 554.479 µs] 168.63 µs (46.2%)
iast_GLOBAL 492.803 µs [471.257 µs, 514.35 µs] 127.871 µs (35.0%)
iast_HARDCODED_SECRET_DISABLED 467.422 µs [446.534 µs, 488.311 µs] 102.49 µs (28.1%)
iast_INACTIVE 444.682 µs [423.485 µs, 465.88 µs] 79.75 µs (21.9%)
iast_TELEMETRY_OFF 464.09 µs [443.345 µs, 484.834 µs] 99.157 µs (27.2%)
tracing 440.667 µs [419.818 µs, 461.516 µs] 75.734 µs (20.8%)

Dacapo

@ValentinZakharov ValentinZakharov force-pushed the vzakharov/post-processing branch 2 times, most recently from 90ff9a9 to 966275f Compare March 12, 2024 17:52
@ValentinZakharov ValentinZakharov added the comp: core Tracer core label Mar 12, 2024
@ValentinZakharov ValentinZakharov force-pushed the vzakharov/post-processing branch from 966275f to 3b692dd Compare March 13, 2024 12:28
@ValentinZakharov ValentinZakharov marked this pull request as ready for review March 13, 2024 13:33
@ValentinZakharov ValentinZakharov requested a review from a team as a code owner March 13, 2024 13:33
Copy link
Contributor

@dougqh dougqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions about the overall approach.

*
* @param postProcessing flag to indicate the need for post-processing
*/
void setRequiresPostProcessing(boolean postProcessing);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a rather low-level detail that I was hoping we wouldn't need to expose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was, to set (in some way) flag in context, that can be indicative of post-processing requirement. Since TraceSegment is implemented in DDSpanContext, for me it look like a perfect candidate, that can be used to set requiresPostProcessing flag from any request handler code.
But, I suppose, we can easily move it to DDSpanContext to avoid exposure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setRequiresPostProcessing is fully moved to DDSpanContext

@@ -133,5 +133,9 @@ public final class TracerConfig {

public static final String TRACE_FLUSH_INTERVAL = "trace.flush.interval";

public static final String TRACE_POST_PROCESSING_ENABLED = "trace.post-processing.enabled";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it will cross-cut across products. Do want the user to be able to control this directly?

I think it might be better for there to knobs that control the individual pieces of post-processing.
Otherwise, we might disable this for ASM only to adversely impact APM as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was imaging in future, we can have various independent post-processing tasks (each called by dd-post-processor thread one-by-one). They could be represented as list:

class TraceProcessingWorker {
     List<TracePostProcessor> tracePostProcessors
     ...
}

And for each TracePostProcessor will have to have it's own way to disable it (via config).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed option to disable post-processing. Instead will be used disabling for each post-processor

return;
}

Thread thread =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to split off a separate thread here?
Given the join further down, it looks this is basically still operating synchronously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the intent was to limit the post-processing time, i.e. allow cancellation of post-processing when the limit was exceeded.

If the post-processing can be represented as a loop, where each iteration is small enough, then this could be achieved without using a separate thread - by checking whether the limit has been exceeded on each iteration and breaking out of the loop early.

Copy link
Contributor

@mcculls mcculls Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code as it is assumes that the post-processing work can be safely interrupted...

Now that I see the overall PR I think a better approach would be for the post-processor to accept a BooleanSupplier (or Predicate if there's a context object shared between the worker and post-processor)

The post-processor could then call that at the start of each loop to decide whether it needs to abort or not.

Then all the worker would need to do is provide an implementation of that BooleanSupplier which changes the result when the time has been exceeded. Another benefit of this approach is that testing could provide a different implementation, which uses a different condition.

In other words - rather than have a separate thread, and assume the post-processing is interrupt-safe, we invert the check and have the post-processor check the condition supplied by the worker whether it should continue or abort. That would avoid the need for extra threads/interrupts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm, my intention to use a thread was due to the need to interrupt it with a timeout. I think, post-processing duration could be related to amount of data (imagine request with very huge body).

@mcculls, good point! I'll try to get rid of creating new threads. The only concern here - in my particular case, I will need to call WAF (native module) during the post-processing, and there will be no way to interrupt native code execution (apart of built-in WAF timeout mechanism)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed thread creating and introduced timeoutCheck as discussed

@ValentinZakharov ValentinZakharov force-pushed the vzakharov/post-processing branch from e3db013 to fd8abcd Compare March 26, 2024 11:58
@@ -16,6 +16,7 @@ public enum AgentThread {
TRACE_STARTUP("dd-agent-startup-datadog-tracer"),
TRACE_MONITOR("dd-trace-monitor"),
TRACE_PROCESSOR("dd-trace-processor"),
TRACE_POST_PROCESSOR("dd-trace-post-processor"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, missed from pervious implementation, removed

@@ -140,6 +140,7 @@ public class DDSpanContext
private final boolean injectBaggageAsTags;
private volatile int encodedOperationName;
private volatile int encodedResourceName;
private boolean requiresPostProcessing;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be volatile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my bad

BooleanSupplier timeoutCheck = () -> System.currentTimeMillis() > deadline;

for (DDSpan span : trace) {
if (timeoutCheck.getAsBoolean()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will incur a cost of calling System.currentTimeMillis() for every span in the trace, even if no spans need post-processing. How about doing an initial loop to collect any spans that need post-processing and only setup the timeout check, etc. when we know we have spans to post-process?

You could also lazily create the collection to hold the spans to post-process, ie. only create it when you find a span to post-process, in which case for traces with no spans to post-process the only cost would be that first iteration to check the flag (ie. no allocation or time calls).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked as suggested

}

if (context.isRequiresPostProcessing()) {
spanPostProcessor.process(span, context, timeoutCheck);
Copy link
Contributor

@mcculls mcculls Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process method returns a boolean but its value isn't checked here - is the plan to use it to abort the post-processing or should the process method be changed to return void?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've missed it. Changed the logic to check return value

Copy link
Contributor

@mcculls mcculls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybeTracePostProcessing should be changed to do an initial scan of the spans and collect those requiring post-processing. If no spans need post-processing it can exit early. Otherwise setup the timeout check, etc. and start post-processing the smaller subset of spans that need it.

Also the description mentions a feature-flag dd.trace.post-processing.enabled to control this feature, but that doesn't appear to be implemented yet?


for (DDSpan span : spansToPostProcess) {
if (timeoutCheck.getAsBoolean() || !spanPostProcessor.process(span, timeoutCheck)) {
log.debug("Span post-processing is interrupted due to timeout.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.debug("Span post-processing is interrupted due to timeout.");
log.debug("Span post-processing interrupted due to timeout.");

@ValentinZakharov ValentinZakharov force-pushed the vzakharov/post-processing branch from af1b0cd to 361b7ba Compare March 27, 2024 15:27
Copy link
Contributor

@mcculls mcculls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the updated implementation - just a question about whether we should leave the no-op AppSecPostProcessor in place before merging

@@ -225,6 +231,7 @@ public TraceSerializingHandler(
} else {
this.ticksRequiredToFlush = Long.MAX_VALUE;
}
this.spanPostProcessor = new AppSecPostProcessor();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since AppSecPostProcessor is currently a no-op, how about leaving spanPostProcessor as null for the initial merge? You could then check spanPostProcessor != null before calling maybeTracePostProcessing (or check it inside that method) so the post-processing only happens when there's a registered post-processor...

Copy link
Contributor Author

@ValentinZakharov ValentinZakharov Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be reasonable, but in this case, test coverage will not pass required minimum.

Copy link
Contributor

@dougqh dougqh Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to write our code to satisfy the coverage tool. The coverage tool is suppose to help make code better not worse.

I have to believe there's a better solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two options:

  1. expose a way to set a test post-processor, used to test out the implementation
    • this would also help you test edge conditions as well as do micro-benchmarking
  2. exclude the class from the code-coverage requirements
    • this is not ideal, but for some situations it is necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done + added test

@ValentinZakharov ValentinZakharov force-pushed the vzakharov/post-processing branch from 4f4cfe9 to 361b7ba Compare March 27, 2024 18:30
@@ -2036,6 +2038,8 @@ PROFILING_DATADOG_PROFILER_ENABLED, isDatadogProfilerSafeInCurrentEnvironment())
"Agentless profiling activated but no api key provided. Profile uploading will likely fail");
}

this.tracePostProcessingTimeout = configProvider.getLong(TRACE_POST_PROCESSING_TIMEOUT, 1000);
Copy link
Contributor

@dougqh dougqh Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the default as a static variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added default value

@@ -945,4 +946,12 @@ private String getTagName(String key) {
// TODO is this decided?
return "_dd." + key + ".json";
}

public void setRequiresPostProcessing(boolean postProcessing) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like that these methods are public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case, setRequiresPostProcessing, should be called from appsec module, so we need to expose it in some way. Either in TraceSegmet, either in DDSpanContext.
Do you have any other options in mind?

* @return {@code true} if the span was successfully processed; {@code false} in case of a
* timeout.
*/
boolean process(DDSpan span, BooleanSupplier timeoutCheck);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the BooleanSupplier? Would it not be better to just have two different methods to cover the two different cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of the supplier was for situations where the processor was doing something non-trivial - assuming that processing could be broken down into smaller chunks then the supplier would allow it to check to see if the timeout had been exceeded mid-processing (otherwise there's no easy way for a processor to check to see if it should abort its processing)

In other words during the process call the value returned by the supplier could change, and it's not possible to represent that as two methods.

This is only really needed when the process call might take a while and need to be cancelled mid-call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. Thanks for the clarification.

@ValentinZakharov ValentinZakharov force-pushed the vzakharov/post-processing branch from 08626e3 to 0338f34 Compare April 12, 2024 16:06
@ValentinZakharov ValentinZakharov force-pushed the vzakharov/post-processing branch from 0338f34 to 749f577 Compare April 15, 2024 16:24
@ValentinZakharov ValentinZakharov requested a review from dougqh April 16, 2024 20:55
@ValentinZakharov ValentinZakharov merged commit 632ec88 into master Apr 16, 2024
@ValentinZakharov ValentinZakharov deleted the vzakharov/post-processing branch April 16, 2024 21:25
@github-actions github-actions bot added this to the 1.33.0 milestone Apr 16, 2024
@PerfectSlayer PerfectSlayer added the tag: experimental Experimental changes label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp: asm waf Application Security Management (WAF) comp: core Tracer core tag: experimental Experimental changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants