Introduced trace post-processing #6800

ValentinZakharov · 2024-03-12T00:03:59Z

What Does This Do

Introduced the Trace Post-Processing - experimental feature.
It allows to process trace at very late stage (in dd-trace-processor thread), before serialization and dispatching to Agent.
The trace processor, check if there is spans that requires post-processing and perform SpanPostProcessor for each of them with timeout. Timeout used for case, when post-processing takes too long and need to be dropped.
The post-processing happens only for span context marked as requiresPostProcessing.

Also added extra configuration options:

trace.post-processing.timeout - to set timeout for post-processing thread (default: 1000ms)

Motivation

The idea is to have mechanism that can safely do extra computations without bringing overhead in requests latency.
This is preparation for new API Security sampling, where heavy schema extraction should happen after the request finished when we have information about span sampling priority. We intent to compute schema only if span will be retained.

Additional Notes

pr-commenter · 2024-03-12T00:50:40Z

Benchmarks

Startup

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	vzakharov/post-processing
git_commit_date	1713289990	1713294465
git_commit_sha	`90a6e41`	`ae1ecea`
release_version	1.33.0-SNAPSHOT~90a6e4126b	1.33.0-SNAPSHOT~ae1ecea607

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1713298013	1713298013
ci_job_id	488444647	488444647
ci_pipeline_id	32245015	32245015
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module	Agent	Agent
parent	None	None
variant	iast	iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 49 metrics, 14 unstable metrics.

Startup time reports for petclinic

gantt
    title petclinic - global startup overhead: candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.076 s) : 0, 1075677
Total [baseline] (10.386 s) : 0, 10385832
Agent [candidate] (1.082 s) : 0, 1081685
Total [candidate] (10.41 s) : 0, 10409678
section appsec
Agent [baseline] (1.205 s) : 0, 1204729
Total [baseline] (10.511 s) : 0, 10511249
Agent [candidate] (1.205 s) : 0, 1204655
Total [candidate] (10.435 s) : 0, 10435305
section iast
Agent [baseline] (1.199 s) : 0, 1198556
Total [baseline] (10.707 s) : 0, 10707283
Agent [candidate] (1.198 s) : 0, 1197673
Total [candidate] (10.837 s) : 0, 10837312
section profiling
Agent [baseline] (1.266 s) : 0, 1265901
Total [baseline] (10.624 s) : 0, 10624017
Agent [candidate] (1.268 s) : 0, 1268075
Total [candidate] (10.656 s) : 0, 10655848

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.076 s	-
Agent	appsec	1.205 s	129.052 ms (12.0%)
Agent	iast	1.199 s	122.879 ms (11.4%)
Agent	profiling	1.266 s	190.224 ms (17.7%)
Total	tracing	10.386 s	-
Total	appsec	10.511 s	125.418 ms (1.2%)
Total	iast	10.707 s	321.451 ms (3.1%)
Total	profiling	10.624 s	238.185 ms (2.3%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.082 s	-
Agent	appsec	1.205 s	122.97 ms (11.4%)
Agent	iast	1.198 s	115.988 ms (10.7%)
Agent	profiling	1.268 s	186.39 ms (17.2%)
Total	tracing	10.41 s	-
Total	appsec	10.435 s	25.628 ms (0.2%)
Total	iast	10.837 s	427.634 ms (4.1%)
Total	profiling	10.656 s	246.17 ms (2.4%)

gantt
    title petclinic - break down per module: candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (673.774 ms) : 0, 673774
BytebuddyAgent [candidate] (678.246 ms) : 0, 678246
GlobalTracer [baseline] (309.564 ms) : 0, 309564
GlobalTracer [candidate] (311.037 ms) : 0, 311037
AppSec [baseline] (49.642 ms) : 0, 49642
AppSec [candidate] (49.58 ms) : 0, 49580
Remote Config [baseline] (666.389 µs) : 0, 666
Remote Config [candidate] (655.047 µs) : 0, 655
Telemetry [baseline] (7.614 ms) : 0, 7614
Telemetry [candidate] (7.534 ms) : 0, 7534
section appsec
BytebuddyAgent [baseline] (699.616 ms) : 0, 699616
BytebuddyAgent [candidate] (699.182 ms) : 0, 699182
GlobalTracer [baseline] (292.955 ms) : 0, 292955
GlobalTracer [candidate] (292.925 ms) : 0, 292925
AppSec [baseline] (150.216 ms) : 0, 150216
AppSec [candidate] (149.764 ms) : 0, 149764
IAST [baseline] (19.177 ms) : 0, 19177
IAST [candidate] (18.868 ms) : 0, 18868
Remote Config [baseline] (611.363 µs) : 0, 611
Remote Config [candidate] (610.55 µs) : 0, 611
Telemetry [baseline] (7.489 ms) : 0, 7489
Telemetry [candidate] (8.735 ms) : 0, 8735
section iast
BytebuddyAgent [baseline] (794.453 ms) : 0, 794453
BytebuddyAgent [candidate] (793.569 ms) : 0, 793569
GlobalTracer [baseline] (287.88 ms) : 0, 287880
GlobalTracer [candidate] (288.583 ms) : 0, 288583
AppSec [baseline] (51.607 ms) : 0, 51607
AppSec [candidate] (49.523 ms) : 0, 49523
IAST [baseline] (22.441 ms) : 0, 22441
IAST [candidate] (22.822 ms) : 0, 22822
Remote Config [baseline] (561.653 µs) : 0, 562
Remote Config [candidate] (591.211 µs) : 0, 591
Telemetry [baseline] (7.32 ms) : 0, 7320
Telemetry [candidate] (8.234 ms) : 0, 8234
section profiling
BytebuddyAgent [baseline] (676.142 ms) : 0, 676142
BytebuddyAgent [candidate] (676.922 ms) : 0, 676922
GlobalTracer [baseline] (379.506 ms) : 0, 379506
GlobalTracer [candidate] (381.176 ms) : 0, 381176
AppSec [baseline] (50.116 ms) : 0, 50116
AppSec [candidate] (50.104 ms) : 0, 50104
Remote Config [baseline] (723.43 µs) : 0, 723
Remote Config [candidate] (706.439 µs) : 0, 706
Telemetry [baseline] (7.427 ms) : 0, 7427
Telemetry [candidate] (7.483 ms) : 0, 7483
ProfilingAgent [baseline] (95.706 ms) : 0, 95706
ProfilingAgent [candidate] (95.438 ms) : 0, 95438
Profiling [baseline] (95.73 ms) : 0, 95730
Profiling [candidate] (95.462 ms) : 0, 95462

Startup time reports for insecure-bank

gantt
    title insecure-bank - global startup overhead: candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.081 s) : 0, 1081499
Total [baseline] (8.559 s) : 0, 8558976
Agent [candidate] (1.099 s) : 0, 1099098
Total [candidate] (8.639 s) : 0, 8639062
section iast
Agent [baseline] (1.196 s) : 0, 1196121
Total [baseline] (9.001 s) : 0, 9000551
Agent [candidate] (1.2 s) : 0, 1199857
Total [candidate] (9.005 s) : 0, 9005325
section iast_HARDCODED_SECRET_DISABLED
Agent [baseline] (1.206 s) : 0, 1206467
Total [baseline] (8.978 s) : 0, 8977733
Agent [candidate] (1.2 s) : 0, 1199716
Total [candidate] (8.98 s) : 0, 8980289
section iast_TELEMETRY_OFF
Agent [baseline] (1.196 s) : 0, 1196195
Total [baseline] (9.012 s) : 0, 9011770
Agent [candidate] (1.207 s) : 0, 1206999
Total [candidate] (9.067 s) : 0, 9067034

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.081 s	-
Agent	iast	1.196 s	114.623 ms (10.6%)
Agent	iast_HARDCODED_SECRET_DISABLED	1.206 s	124.969 ms (11.6%)
Agent	iast_TELEMETRY_OFF	1.196 s	114.697 ms (10.6%)
Total	tracing	8.559 s	-
Total	iast	9.001 s	441.575 ms (5.2%)
Total	iast_HARDCODED_SECRET_DISABLED	8.978 s	418.757 ms (4.9%)
Total	iast_TELEMETRY_OFF	9.012 s	452.794 ms (5.3%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.099 s	-
Agent	iast	1.2 s	100.759 ms (9.2%)
Agent	iast_HARDCODED_SECRET_DISABLED	1.2 s	100.618 ms (9.2%)
Agent	iast_TELEMETRY_OFF	1.207 s	107.901 ms (9.8%)
Total	tracing	8.639 s	-
Total	iast	9.005 s	366.263 ms (4.2%)
Total	iast_HARDCODED_SECRET_DISABLED	8.98 s	341.227 ms (3.9%)
Total	iast_TELEMETRY_OFF	9.067 s	427.972 ms (5.0%)

gantt
    title insecure-bank - break down per module: candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (677.732 ms) : 0, 677732
BytebuddyAgent [candidate] (688.508 ms) : 0, 688508
GlobalTracer [baseline] (311.152 ms) : 0, 311152
GlobalTracer [candidate] (316.532 ms) : 0, 316532
AppSec [baseline] (49.781 ms) : 0, 49781
AppSec [candidate] (50.484 ms) : 0, 50484
Remote Config [baseline] (663.623 µs) : 0, 664
Remote Config [candidate] (681.641 µs) : 0, 682
Telemetry [baseline] (7.578 ms) : 0, 7578
Telemetry [candidate] (7.723 ms) : 0, 7723
section iast
BytebuddyAgent [baseline] (792.546 ms) : 0, 792546
BytebuddyAgent [candidate] (794.825 ms) : 0, 794825
GlobalTracer [baseline] (287.101 ms) : 0, 287101
GlobalTracer [candidate] (288.854 ms) : 0, 288854
AppSec [baseline] (48.84 ms) : 0, 48840
AppSec [candidate] (50.179 ms) : 0, 50179
IAST [baseline] (23.916 ms) : 0, 23916
IAST [candidate] (22.138 ms) : 0, 22138
Remote Config [baseline] (569.674 µs) : 0, 570
Remote Config [candidate] (574.756 µs) : 0, 575
Telemetry [baseline] (8.888 ms) : 0, 8888
Telemetry [candidate] (8.924 ms) : 0, 8924
section iast_HARDCODED_SECRET_DISABLED
BytebuddyAgent [baseline] (800.093 ms) : 0, 800093
BytebuddyAgent [candidate] (795.162 ms) : 0, 795162
GlobalTracer [baseline] (289.549 ms) : 0, 289549
GlobalTracer [candidate] (288.548 ms) : 0, 288548
AppSec [baseline] (51.472 ms) : 0, 51472
AppSec [candidate] (50.821 ms) : 0, 50821
IAST [baseline] (22.875 ms) : 0, 22875
IAST [candidate] (21.997 ms) : 0, 21997
Remote Config [baseline] (579.551 µs) : 0, 580
Remote Config [candidate] (612.189 µs) : 0, 612
Telemetry [baseline] (7.331 ms) : 0, 7331
Telemetry [candidate] (8.171 ms) : 0, 8171
section iast_TELEMETRY_OFF
BytebuddyAgent [baseline] (792.158 ms) : 0, 792158
BytebuddyAgent [candidate] (799.567 ms) : 0, 799567
GlobalTracer [baseline] (287.819 ms) : 0, 287819
GlobalTracer [candidate] (291.483 ms) : 0, 291483
AppSec [baseline] (51.962 ms) : 0, 51962
AppSec [candidate] (49.858 ms) : 0, 49858
IAST [baseline] (22.644 ms) : 0, 22644
IAST [candidate] (21.987 ms) : 0, 21987
Remote Config [baseline] (586.448 µs) : 0, 586
Remote Config [candidate] (591.283 µs) : 0, 591
Telemetry [baseline] (6.649 ms) : 0, 6649
Telemetry [candidate] (8.896 ms) : 0, 8896

Load

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
end_time	2024-04-16T19:39:09	2024-04-16T20:01:07
git_branch	master	vzakharov/post-processing
git_commit_date	1713289990	1713294465
git_commit_sha	`90a6e41`	`ae1ecea`
release_version	1.33.0-SNAPSHOT~90a6e4126b	1.33.0-SNAPSHOT~ae1ecea607
start_time	2024-04-16T19:38:56	2024-04-16T20:00:54

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1713298013	1713298013
ci_job_id	488444647	488444647
ci_pipeline_id	32245015	32245015
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant	iast	iast

Summary

Found 0 performance improvements and 1 performance regressions! Performance is the same for 10 metrics, 17 unstable metrics.

scenario	Δ mean http_req_duration	Δ mean throughput	candidate mean http_req_duration	candidate mean throughput	baseline mean http_req_duration	baseline mean throughput
scenario:load:petclinic:profiling	worse [+36.076µs; +90.212µs] or [+2.420%; +6.052%]	unstable [-552.081op/s; +552.081op/s] or [-17.943%; +17.943%]	1.554ms	3076.923op/s	1.491ms	3076.923op/s

Request duration reports for petclinic

gantt
    title petclinic - request duration [CI 0.99] : candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.326 ms) : 1307, 1345
.   : milestone, 1326,
appsec (1.717 ms) : 1692, 1742
.   : milestone, 1717,
appsec_no_iast (1.73 ms) : 1706, 1754
.   : milestone, 1730,
iast (1.49 ms) : 1467, 1513
.   : milestone, 1490,
profiling (1.491 ms) : 1466, 1515
.   : milestone, 1491,
tracing (1.455 ms) : 1430, 1479
.   : milestone, 1455,
section candidate
no_agent (1.344 ms) : 1324, 1363
.   : milestone, 1344,
appsec (1.684 ms) : 1659, 1709
.   : milestone, 1684,
appsec_no_iast (1.732 ms) : 1709, 1755
.   : milestone, 1732,
iast (1.498 ms) : 1476, 1520
.   : milestone, 1498,
profiling (1.554 ms) : 1528, 1580
.   : milestone, 1554,
tracing (1.506 ms) : 1481, 1532
.   : milestone, 1506,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.326 ms [1.307 ms, 1.345 ms]	-
appsec	1.717 ms [1.692 ms, 1.742 ms]	391.342 µs (29.5%)
appsec_no_iast	1.73 ms [1.706 ms, 1.754 ms]	403.577 µs (30.4%)
iast	1.49 ms [1.467 ms, 1.513 ms]	164.032 µs (12.4%)
profiling	1.491 ms [1.466 ms, 1.515 ms]	164.45 µs (12.4%)
tracing	1.455 ms [1.43 ms, 1.479 ms]	128.759 µs (9.7%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.344 ms [1.324 ms, 1.363 ms]	-
appsec	1.684 ms [1.659 ms, 1.709 ms]	340.408 µs (25.3%)
appsec_no_iast	1.732 ms [1.709 ms, 1.755 ms]	388.355 µs (28.9%)
iast	1.498 ms [1.476 ms, 1.52 ms]	154.198 µs (11.5%)
profiling	1.554 ms [1.528 ms, 1.58 ms]	210.052 µs (15.6%)
tracing	1.506 ms [1.481 ms, 1.532 ms]	162.68 µs (12.1%)

Request duration reports for insecure-bank

gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.33.0-SNAPSHOT~ae1ecea607, baseline=1.33.0-SNAPSHOT~90a6e4126b
    dateFormat X
    axisFormat %s
section baseline
no_agent (359.641 µs) : 340, 379
.   : milestone, 360,
iast (476.093 µs) : 455, 497
.   : milestone, 476,
iast_FULL (535.183 µs) : 514, 557
.   : milestone, 535,
iast_GLOBAL (500.211 µs) : 478, 522
.   : milestone, 500,
iast_HARDCODED_SECRET_DISABLED (472.058 µs) : 451, 493
.   : milestone, 472,
iast_INACTIVE (443.423 µs) : 423, 464
.   : milestone, 443,
iast_TELEMETRY_OFF (461.946 µs) : 442, 482
.   : milestone, 462,
tracing (442.266 µs) : 421, 463
.   : milestone, 442,
section candidate
no_agent (364.933 µs) : 345, 385
.   : milestone, 365,
iast (473.102 µs) : 451, 495
.   : milestone, 473,
iast_FULL (533.563 µs) : 513, 554
.   : milestone, 534,
iast_GLOBAL (492.803 µs) : 471, 514
.   : milestone, 493,
iast_HARDCODED_SECRET_DISABLED (467.422 µs) : 447, 488
.   : milestone, 467,
iast_INACTIVE (444.682 µs) : 423, 466
.   : milestone, 445,
iast_TELEMETRY_OFF (464.09 µs) : 443, 485
.   : milestone, 464,
tracing (440.667 µs) : 420, 462
.   : milestone, 441,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	359.641 µs [339.939 µs, 379.342 µs]	-
iast	476.093 µs [455.289 µs, 496.896 µs]	116.452 µs (32.4%)
iast_FULL	535.183 µs [513.697 µs, 556.669 µs]	175.542 µs (48.8%)
iast_GLOBAL	500.211 µs [478.247 µs, 522.175 µs]	140.57 µs (39.1%)
iast_HARDCODED_SECRET_DISABLED	472.058 µs [451.084 µs, 493.031 µs]	112.417 µs (31.3%)
iast_INACTIVE	443.423 µs [422.633 µs, 464.212 µs]	83.782 µs (23.3%)
iast_TELEMETRY_OFF	461.946 µs [441.621 µs, 482.271 µs]	102.305 µs (28.4%)
tracing	442.266 µs [421.356 µs, 463.175 µs]	82.625 µs (23.0%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	364.933 µs [344.677 µs, 385.189 µs]	-
iast	473.102 µs [451.245 µs, 494.96 µs]	108.17 µs (29.6%)
iast_FULL	533.563 µs [512.647 µs, 554.479 µs]	168.63 µs (46.2%)
iast_GLOBAL	492.803 µs [471.257 µs, 514.35 µs]	127.871 µs (35.0%)
iast_HARDCODED_SECRET_DISABLED	467.422 µs [446.534 µs, 488.311 µs]	102.49 µs (28.1%)
iast_INACTIVE	444.682 µs [423.485 µs, 465.88 µs]	79.75 µs (21.9%)
iast_TELEMETRY_OFF	464.09 µs [443.345 µs, 484.834 µs]	99.157 µs (27.2%)
tracing	440.667 µs [419.818 µs, 461.516 µs]	75.734 µs (20.8%)

Dacapo

dougqh

I have a few questions about the overall approach.

dougqh · 2024-03-13T19:56:28Z

dd-trace-api/src/main/java/datadog/trace/api/internal/TraceSegment.java

+   *
+   * @param postProcessing flag to indicate the need for post-processing
+   */
+  void setRequiresPostProcessing(boolean postProcessing);


This seems like a rather low-level detail that I was hoping we wouldn't need to expose.

The idea was, to set (in some way) flag in context, that can be indicative of post-processing requirement. Since TraceSegment is implemented in DDSpanContext, for me it look like a perfect candidate, that can be used to set requiresPostProcessing flag from any request handler code.
But, I suppose, we can easily move it to DDSpanContext to avoid exposure.

setRequiresPostProcessing is fully moved to DDSpanContext

dougqh · 2024-03-13T19:58:56Z

dd-trace-api/src/main/java/datadog/trace/api/config/TracerConfig.java

@@ -133,5 +133,9 @@ public final class TracerConfig {

  public static final String TRACE_FLUSH_INTERVAL = "trace.flush.interval";

+  public static final String TRACE_POST_PROCESSING_ENABLED = "trace.post-processing.enabled";


This seems like it will cross-cut across products. Do want the user to be able to control this directly?

I think it might be better for there to knobs that control the individual pieces of post-processing.
Otherwise, we might disable this for ASM only to adversely impact APM as well.

Good point. I was imaging in future, we can have various independent post-processing tasks (each called by dd-post-processor thread one-by-one). They could be represented as list:

class TraceProcessingWorker { List<TracePostProcessor> tracePostProcessors ... }

And for each TracePostProcessor will have to have it's own way to disable it (via config).

Removed option to disable post-processing. Instead will be used disabling for each post-processor

dougqh · 2024-03-13T20:02:31Z

dd-trace-core/src/main/java/datadog/trace/common/writer/TraceProcessingWorker.java

+        return;
+      }
+
+      Thread thread =


Why do we need to split off a separate thread here?
Given the join further down, it looks this is basically still operating synchronously.

IIRC the intent was to limit the post-processing time, i.e. allow cancellation of post-processing when the limit was exceeded.

If the post-processing can be represented as a loop, where each iteration is small enough, then this could be achieved without using a separate thread - by checking whether the limit has been exceeded on each iteration and breaking out of the loop early.

The code as it is assumes that the post-processing work can be safely interrupted...

Now that I see the overall PR I think a better approach would be for the post-processor to accept a BooleanSupplier (or Predicate if there's a context object shared between the worker and post-processor)

The post-processor could then call that at the start of each loop to decide whether it needs to abort or not.

Then all the worker would need to do is provide an implementation of that BooleanSupplier which changes the result when the time has been exceeded. Another benefit of this approach is that testing could provide a different implementation, which uses a different condition.

In other words - rather than have a separate thread, and assume the post-processing is interrupt-safe, we invert the check and have the post-processor check the condition supplied by the worker whether it should continue or abort. That would avoid the need for extra threads/interrupts.

I confirm, my intention to use a thread was due to the need to interrupt it with a timeout. I think, post-processing duration could be related to amount of data (imagine request with very huge body).

@mcculls, good point! I'll try to get rid of creating new threads. The only concern here - in my particular case, I will need to call WAF (native module) during the post-processing, and there will be no way to interrupt native code execution (apart of built-in WAF timeout mechanism)

Removed thread creating and introduced timeoutCheck as discussed

mcculls · 2024-03-27T11:03:49Z

internal-api/src/main/java/datadog/trace/util/AgentThreadFactory.java

@@ -16,6 +16,7 @@ public enum AgentThread {
    TRACE_STARTUP("dd-agent-startup-datadog-tracer"),
    TRACE_MONITOR("dd-trace-monitor"),
    TRACE_PROCESSOR("dd-trace-processor"),
+    TRACE_POST_PROCESSOR("dd-trace-post-processor"),


This is no longer needed?

Thanks, missed from pervious implementation, removed

dd-trace-core/src/main/java/datadog/trace/core/postprocessor/SpanPostProcessor.java

mcculls · 2024-03-27T11:06:50Z

dd-trace-core/src/main/java/datadog/trace/core/DDSpanContext.java

@@ -140,6 +140,7 @@ public class DDSpanContext
  private final boolean injectBaggageAsTags;
  private volatile int encodedOperationName;
  private volatile int encodedResourceName;
+  private boolean requiresPostProcessing;


does this need to be volatile?

Right, my bad

mcculls · 2024-03-27T11:15:54Z

dd-trace-core/src/main/java/datadog/trace/common/writer/TraceProcessingWorker.java

+      BooleanSupplier timeoutCheck = () -> System.currentTimeMillis() > deadline;
+
+      for (DDSpan span : trace) {
+        if (timeoutCheck.getAsBoolean()) {


This will incur a cost of calling System.currentTimeMillis() for every span in the trace, even if no spans need post-processing. How about doing an initial loop to collect any spans that need post-processing and only setup the timeout check, etc. when we know we have spans to post-process?

You could also lazily create the collection to hold the spans to post-process, ie. only create it when you find a span to post-process, in which case for traces with no spans to post-process the only cost would be that first iteration to check the flag (ie. no allocation or time calls).

Reworked as suggested

dd-trace-core/src/main/java/datadog/trace/core/postprocessor/SpanPostProcessor.java

mcculls · 2024-03-27T11:20:58Z

dd-trace-core/src/main/java/datadog/trace/common/writer/TraceProcessingWorker.java

+        }
+
+        if (context.isRequiresPostProcessing()) {
+          spanPostProcessor.process(span, context, timeoutCheck);


The process method returns a boolean but its value isn't checked here - is the plan to use it to abort the post-processing or should the process method be changed to return void?

Yeah, I've missed it. Changed the logic to check return value

mcculls

maybeTracePostProcessing should be changed to do an initial scan of the spans and collect those requiring post-processing. If no spans need post-processing it can exit early. Otherwise setup the timeout check, etc. and start post-processing the smaller subset of spans that need it.

Also the description mentions a feature-flag dd.trace.post-processing.enabled to control this feature, but that doesn't appear to be implemented yet?

mcculls · 2024-03-27T15:27:14Z

dd-trace-core/src/main/java/datadog/trace/common/writer/TraceProcessingWorker.java

+
+      for (DDSpan span : spansToPostProcess) {
+        if (timeoutCheck.getAsBoolean() || !spanPostProcessor.process(span, timeoutCheck)) {
+          log.debug("Span post-processing is interrupted due to timeout.");


Suggested change

log.debug("Span post-processing is interrupted due to timeout.");

log.debug("Span post-processing interrupted due to timeout.");

mcculls

I'm OK with the updated implementation - just a question about whether we should leave the no-op AppSecPostProcessor in place before merging

mcculls · 2024-03-27T15:36:00Z

dd-trace-core/src/main/java/datadog/trace/common/writer/TraceProcessingWorker.java

@@ -225,6 +231,7 @@ public TraceSerializingHandler(
      } else {
        this.ticksRequiredToFlush = Long.MAX_VALUE;
      }
+      this.spanPostProcessor = new AppSecPostProcessor();


Since AppSecPostProcessor is currently a no-op, how about leaving spanPostProcessor as null for the initial merge? You could then check spanPostProcessor != null before calling maybeTracePostProcessing (or check it inside that method) so the post-processing only happens when there's a registered post-processor...

That could be reasonable, but in this case, test coverage will not pass required minimum.

I don't want to write our code to satisfy the coverage tool. The coverage tool is suppose to help make code better not worse.

I have to believe there's a better solution.

Two options:

expose a way to set a test post-processor, used to test out the implementation

this would also help you test edge conditions as well as do micro-benchmarking

exclude the class from the code-coverage requirements

this is not ideal, but for some situations it is necessary

Done + added test

dougqh · 2024-03-27T18:42:41Z

internal-api/src/main/java/datadog/trace/api/Config.java

@@ -2036,6 +2038,8 @@ PROFILING_DATADOG_PROFILER_ENABLED, isDatadogProfilerSafeInCurrentEnvironment())
          "Agentless profiling activated but no api key provided. Profile uploading will likely fail");
    }

+    this.tracePostProcessingTimeout = configProvider.getLong(TRACE_POST_PROCESSING_TIMEOUT, 1000);


I'd like to see the default as a static variable.

Added default value

dougqh · 2024-03-27T18:45:37Z

dd-trace-core/src/main/java/datadog/trace/core/DDSpanContext.java

@@ -945,4 +946,12 @@ private String getTagName(String key) {
    // TODO is this decided?
    return "_dd." + key + ".json";
  }
+
+  public void setRequiresPostProcessing(boolean postProcessing) {


I don't particularly like that these methods are public.

In my case, setRequiresPostProcessing, should be called from appsec module, so we need to expose it in some way. Either in TraceSegmet, either in DDSpanContext.
Do you have any other options in mind?

dougqh · 2024-03-27T18:47:23Z

dd-trace-core/src/main/java/datadog/trace/core/postprocessor/SpanPostProcessor.java

+   * @return {@code true} if the span was successfully processed; {@code false} in case of a
+   *     timeout.
+   */
+  boolean process(DDSpan span, BooleanSupplier timeoutCheck);


Why the BooleanSupplier? Would it not be better to just have two different methods to cover the two different cases?

The idea of the supplier was for situations where the processor was doing something non-trivial - assuming that processing could be broken down into smaller chunks then the supplier would allow it to check to see if the timeout had been exceeded mid-processing (otherwise there's no easy way for a processor to check to see if it should abort its processing)

In other words during the process call the value returned by the supplier could change, and it's not possible to represent that as two methods.

This is only really needed when the process call might take a while and need to be cancelled mid-call.

Ah, that makes sense. Thanks for the clarification.

…panPostProcessor.java Co-authored-by: Stuart McCulloch <[email protected]>

ValentinZakharov added the comp: asm waf Application Security Management (WAF) label Mar 12, 2024

ValentinZakharov self-assigned this Mar 12, 2024

ValentinZakharov force-pushed the vzakharov/post-processing branch from c340d45 to adf13d0 Compare March 12, 2024 00:48

ValentinZakharov force-pushed the vzakharov/post-processing branch 2 times, most recently from 90ff9a9 to 966275f Compare March 12, 2024 17:52

ValentinZakharov added the comp: core Tracer core label Mar 12, 2024

ValentinZakharov force-pushed the vzakharov/post-processing branch from 966275f to 3b692dd Compare March 13, 2024 12:28

ValentinZakharov marked this pull request as ready for review March 13, 2024 13:33

ValentinZakharov requested a review from a team as a code owner March 13, 2024 13:33

ValentinZakharov requested review from dougqh and nayeem-kamal March 13, 2024 13:34

dougqh requested changes Mar 13, 2024

View reviewed changes

ValentinZakharov requested review from mcculls and dougqh March 18, 2024 23:59

ValentinZakharov force-pushed the vzakharov/post-processing branch from e3db013 to fd8abcd Compare March 26, 2024 11:58

mcculls reviewed Mar 27, 2024

View reviewed changes

dd-trace-core/src/main/java/datadog/trace/core/postprocessor/SpanPostProcessor.java Outdated Show resolved Hide resolved

mcculls reviewed Mar 27, 2024

View reviewed changes

dd-trace-core/src/main/java/datadog/trace/core/postprocessor/SpanPostProcessor.java Show resolved Hide resolved

mcculls reviewed Mar 27, 2024

View reviewed changes

mcculls requested changes Mar 27, 2024

View reviewed changes

mcculls reviewed Mar 27, 2024

View reviewed changes

ValentinZakharov force-pushed the vzakharov/post-processing branch from af1b0cd to 361b7ba Compare March 27, 2024 15:27

mcculls approved these changes Mar 27, 2024

View reviewed changes

ValentinZakharov force-pushed the vzakharov/post-processing branch from 4f4cfe9 to 361b7ba Compare March 27, 2024 18:30

dougqh reviewed Mar 27, 2024

View reviewed changes

ValentinZakharov force-pushed the vzakharov/post-processing branch from 08626e3 to 0338f34 Compare April 12, 2024 16:06

ValentinZakharov and others added 10 commits April 15, 2024 19:24

Introduced trace post-processing

bfe73e9

setRequiresPostProcessing fully moved to DDSpanContext

da1e19b

Removed Thread creation and added timeout boolean supplier

8ac8c0e

Removed option to disable post-processing

d70d1cf

TracePostProcessor replaced with SpanPostProcessor

2dc3605

Update dd-trace-core/src/main/java/datadog/trace/core/postprocessor/S…

81bd615

…panPostProcessor.java Co-authored-by: Stuart McCulloch <[email protected]>

Minor improvements for pr-review

a7ccdef

Spotless

ed04cb2

Default trace post-processing timeout value

96342ff

Added test

749f577

ValentinZakharov force-pushed the vzakharov/post-processing branch from 0338f34 to 749f577 Compare April 15, 2024 16:24

Catch any exception, if happens during span post-processing

ae1ecea

ValentinZakharov requested a review from dougqh April 16, 2024 20:55

dougqh approved these changes Apr 16, 2024

View reviewed changes

ValentinZakharov merged commit 632ec88 into master Apr 16, 2024

ValentinZakharov deleted the vzakharov/post-processing branch April 16, 2024 21:25

github-actions bot added this to the 1.33.0 milestone Apr 16, 2024

PerfectSlayer added the tag: experimental Experimental changes label Apr 18, 2024

smola mentioned this pull request Mar 3, 2025

New API Security sampling algorithm #8178

Merged

		@@ -133,5 +133,9 @@ public final class TracerConfig {

		public static final String TRACE_FLUSH_INTERVAL = "trace.flush.interval";

		public static final String TRACE_POST_PROCESSING_ENABLED = "trace.post-processing.enabled";

	log.debug("Span post-processing is interrupted due to timeout.");
	log.debug("Span post-processing interrupted due to timeout.");

Introduced trace post-processing #6800

Introduced trace post-processing #6800

Uh oh!

Conversation

ValentinZakharov commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

Uh oh!

pr-commenter bot commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Startup

Parameters

Summary

Load

Parameters

Summary

Dacapo

Uh oh!

dougqh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcculls Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mcculls Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcculls left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcculls left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ValentinZakharov Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

ValentinZakharov commented Mar 12, 2024 •

edited

Loading

pr-commenter bot commented Mar 12, 2024 •

edited

Loading

mcculls Mar 13, 2024 •

edited

Loading

mcculls Mar 27, 2024 •

edited

Loading

mcculls left a comment •

edited

Loading

ValentinZakharov Mar 27, 2024 •

edited

Loading

dougqh Mar 27, 2024 •

edited

Loading

dougqh Mar 27, 2024 •

edited

Loading