Skip to content

Enforce parent-child relationship for ES/HTTP span #1963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented Feb 3, 2021

All instrumented calls from the Elasticsearch client create two spans: first, an Elasticsearch span, and then, an http span. Currently, the http span is a sibling of the Elasticsearch, rather than its child, because the http span is created in the same async operation. This leads to breakdown timings being counted twice and makes it hard to see parent/child relationships in the trace waterfall.

As addressing this in a general manner has bigger consequences, I've tried to address this specifically for the ES client's instrumentation where the scope is relatively small and known.

I'm also wondering if we can use this strategy in our custom instrumentation in Kibana (instead of awaiting process.nextTick or Promise.resolve.

@github-actions github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Feb 3, 2021
@trentm
Copy link
Member

trentm commented Feb 3, 2021

Interesting.

Compare to https://github.com/elastic/apm-agent-nodejs/pull/1890/files FWIW, which is my proposed patch for solving this general issue for all cases -- albeit with the requirement to grasp possible impact of that wider behaviour change.

@apmmachine
Copy link
Contributor

apmmachine commented Feb 3, 2021

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Branch indexing

  • Start Time: 2021-07-06T00:59:22.472+0000

  • Duration: 12 min 47 sec

  • Commit: 1ed3e8b

Test stats 🧪

Test Results
Failed 18
Passed 0
Skipped 0
Total 18

Trends 🧪

Image of Build Times

Image of Tests

Test errors 18

Expand to view the tests failures

> Show only the first 10 test failures

Test / Node.js-14.0 / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-14 / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-14-async-hooks-false / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-13 / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-10-async-hooks-false / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-10.0-async-hooks-false / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-14.0-async-hooks-false / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-12.0 / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-12-async-hooks-false / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Test / Node.js-10 / [empty] – *-output-junit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963/src/github.com/elastic/apm-agent-nodejs/*-output-junit.xml was length 0 
    

Steps errors 38

Expand to view the steps failures

Show only the first 10 steps failures

Run Tests
  • Took 0 min 3 sec . View more details on here
  • Description: .ci/scripts/test.sh -b "release" -t "" "8"
Run Tests
  • Took 0 min 1 sec . View more details on here
  • Description: .ci/scripts/test.sh -b "release" -t "" "8"
Run Tests
  • Took 0 min 2 sec . View more details on here
  • Description: .ci/scripts/test.sh -b "release" -t "" "8"
Run Tests
  • Took 0 min 1 sec . View more details on here
  • Description: .ci/scripts/test.sh -b "release" -t "" "8"
Run Tests
  • Took 0 min 3 sec . View more details on here
  • Description: .ci/scripts/test.sh -b "release" -t "" "8.6"
Run Tests
  • Took 0 min 1 sec . View more details on here
  • Description: .ci/scripts/test.sh -b "release" -t "" "8.6"
Run Tests
  • Took 0 min 2 sec . View more details on here
  • Description: .ci/scripts/test.sh -b "release" -t "" "8.6"
Run Tests
  • Took 0 min 1 sec . View more details on here
  • Description: .ci/scripts/test.sh -b "release" -t "" "8.6"
Prepare services
  • Took 0 min 2 sec . View more details on here
  • Description: .ci/scripts/windows/prepare-test.sh 14
Error signal
  • Took 0 min 0 sec . View more details on here
  • Description: hudson.AbortException: script returned exit code 1

Log output

Expand to view the last 100 lines of log output

[2021-07-06T01:10:28.103Z] 
[2021-07-06T01:10:28.103Z] found 13 vulnerabilities (3 low, 7 high, 3 critical)
[2021-07-06T01:10:28.103Z]   run `npm audit fix` to fix them, or `npm audit` for details
[2021-07-06T01:10:28.103Z] + node --version
[2021-07-06T01:10:28.103Z] v12.22.1
[2021-07-06T01:10:28.103Z] + npm --version
[2021-07-06T01:10:28.103Z] 6.14.12
[2021-07-06T01:10:28.103Z] + npm run lint
[2021-07-06T01:10:28.362Z] 
[2021-07-06T01:10:28.362Z] > [email protected] lint /app
[2021-07-06T01:10:28.362Z] > standard
[2021-07-06T01:10:28.362Z] 
[2021-07-06T01:10:38.335Z] + npm run test:deps
[2021-07-06T01:10:38.335Z] 
[2021-07-06T01:10:38.335Z] > [email protected] test:deps /app
[2021-07-06T01:10:38.335Z] > dependency-check *.js 'lib/**/*.js' 'test/**/*.js' --no-dev -i async_hooks -i perf_hooks -i parseurl
[2021-07-06T01:10:38.335Z] 
[2021-07-06T01:10:40.534Z] + cd /app
[2021-07-06T01:10:40.534Z] + .ci/scripts/test_types_babel_esm.sh
[2021-07-06T01:10:40.534Z] + npm config list
[2021-07-06T01:10:40.792Z] ; cli configs
[2021-07-06T01:10:40.792Z] metrics-registry = "https://registry.npmjs.org/"
[2021-07-06T01:10:40.792Z] scope = ""
[2021-07-06T01:10:40.792Z] user-agent = "npm/6.14.12 node/v12.22.1 linux x64 ci/jenkins"
[2021-07-06T01:10:40.792Z] 
[2021-07-06T01:10:40.792Z] ; userconfig /app/.npmrc
[2021-07-06T01:10:40.792Z] package-lock = false
[2021-07-06T01:10:40.792Z] 
[2021-07-06T01:10:40.792Z] ; node bin location = /usr/local/bin/node
[2021-07-06T01:10:40.792Z] ; cwd = /app
[2021-07-06T01:10:40.792Z] ; HOME = /app
[2021-07-06T01:10:40.792Z] ; "npm config ls -l" to show all defaults.
[2021-07-06T01:10:40.792Z] 
[2021-07-06T01:10:40.792Z] + npm install
[2021-07-06T01:10:48.907Z] npm WARN optional SKIPPING OPTIONAL DEPENDENCY: fsevents@~2.3.2 (node_modules/chokidar/node_modules/fsevents):
[2021-07-06T01:10:48.907Z] npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for [email protected]: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"x64"})
[2021-07-06T01:10:48.907Z] npm WARN [email protected] requires a peer of graphql@^14.4.1 but none is installed. You must install peer dependencies yourself.
[2021-07-06T01:10:48.907Z] 
[2021-07-06T01:10:48.907Z] audited 1563 packages in 7.67s
[2021-07-06T01:10:49.842Z] 
[2021-07-06T01:10:49.842Z] 102 packages are looking for funding
[2021-07-06T01:10:49.842Z]   run `npm fund` for details
[2021-07-06T01:10:49.842Z] 
[2021-07-06T01:10:49.842Z] found 13 vulnerabilities (3 low, 7 high, 3 critical)
[2021-07-06T01:10:49.842Z]   run `npm audit fix` to fix them, or `npm audit` for details
[2021-07-06T01:10:49.842Z] + node --version
[2021-07-06T01:10:49.842Z] v12.22.1
[2021-07-06T01:10:49.842Z] + npm --version
[2021-07-06T01:10:50.099Z] 6.14.12
[2021-07-06T01:10:50.100Z] ++ node --version
[2021-07-06T01:10:50.100Z] ++ cut -d . -f1
[2021-07-06T01:10:50.100Z] ++ cut -d v -f2
[2021-07-06T01:10:50.100Z] + major_node_version=12
[2021-07-06T01:10:50.100Z] ++ node --version
[2021-07-06T01:10:50.100Z] ++ cut -d . -f2
[2021-07-06T01:10:50.100Z] + minor_node_version=22
[2021-07-06T01:10:50.100Z] + npm run test:types
[2021-07-06T01:10:50.358Z] 
[2021-07-06T01:10:50.358Z] > [email protected] test:types /app
[2021-07-06T01:10:50.358Z] > tsc --project test/types/tsconfig.json && tsc --project test/types/transpile/tsconfig.json && node test/types/transpile/index.js
[2021-07-06T01:10:50.358Z] 
[2021-07-06T01:11:02.568Z] + [[ 12 -ne 13 ]]
[2021-07-06T01:11:02.568Z] + npm run test:babel
[2021-07-06T01:11:02.568Z] 
[2021-07-06T01:11:02.568Z] > [email protected] test:babel /app
[2021-07-06T01:11:02.568Z] > babel test/babel/src.js --out-file test/babel/out.js && node test/babel/out.js
[2021-07-06T01:11:02.568Z] 
[2021-07-06T01:11:02.827Z] + npm run test:esm
[2021-07-06T01:11:03.084Z] 
[2021-07-06T01:11:03.085Z] > [email protected] test:esm /app
[2021-07-06T01:11:03.085Z] > node --experimental-modules test/esm/index.mjs
[2021-07-06T01:11:03.085Z] 
[2021-07-06T01:11:03.342Z] APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-07-06T01:11:03.628Z] $ docker stop --time=1 6c809874e0242176dd0985778bc89f0b3539acbf18f66097eed49828118749d8
[2021-07-06T01:11:04.985Z] $ docker rm -f 6c809874e0242176dd0985778bc89f0b3539acbf18f66097eed49828118749d8
[2021-07-06T01:11:05.092Z] [INFO] getVaultSecret: Getting secrets
[2021-07-06T01:11:05.138Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-07-06T01:11:06.574Z] [INFO] getVaultSecret: Getting secrets
[2021-07-06T01:11:06.621Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-07-06T01:11:07.819Z] Stage "TAV Test" skipped due to earlier failure(s)
[2021-07-06T01:11:07.861Z] Stage "Edge Test" skipped due to earlier failure(s)
[2021-07-06T01:11:07.919Z] Stage "Nightly Test" skipped due to earlier failure(s)
[2021-07-06T01:11:07.927Z] Stage "Nightly Test - No async hooks" skipped due to earlier failure(s)
[2021-07-06T01:11:07.933Z] Stage "RC Test" skipped due to earlier failure(s)
[2021-07-06T01:11:07.941Z] Stage "RC Test - No async hooks" skipped due to earlier failure(s)
[2021-07-06T01:11:08.028Z] Failed in branch Nightly Test
[2021-07-06T01:11:08.034Z] Failed in branch Nightly Test - No async hooks
[2021-07-06T01:11:08.038Z] Failed in branch RC Test
[2021-07-06T01:11:08.045Z] Failed in branch RC Test - No async hooks
[2021-07-06T01:11:08.111Z] Stage "Integration Tests" skipped due to earlier failure(s)
[2021-07-06T01:11:08.152Z] Stage "Release" skipped due to earlier failure(s)
[2021-07-06T01:11:08.167Z] Stage "Release" skipped due to earlier failure(s)
[2021-07-06T01:11:08.241Z] Stage "Benchmarks" skipped due to earlier failure(s)
[2021-07-06T01:11:08.493Z] Running on worker-854309 in /var/lib/jenkins/workspace/ejs_apm-agent-nodejs-mbp_PR-1963
[2021-07-06T01:11:08.543Z] [INFO] getVaultSecret: Getting secrets
[2021-07-06T01:11:08.587Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-07-06T01:11:10.953Z] + chmod 755 generate-build-data.sh
[2021-07-06T01:11:10.953Z] + ./generate-build-data.sh https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-agent-nodejs-mbp/PR-1963/ https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-agent-nodejs-mbp/PR-1963/runs/58 FAILURE 707083
[2021-07-06T01:11:10.953Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-agent-nodejs-mbp/PR-1963/runs/58/steps/?limit=10000 -o steps-info.json
[2021-07-06T01:11:35.647Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-agent-nodejs-mbp/PR-1963/runs/58/tests/?status=FAILED -o tests-errors.json

@dgieselaar dgieselaar marked this pull request as ready for review February 3, 2021 20:58
@dgieselaar
Copy link
Member Author

@trentm I've moved this out of draft so we can make a decision at some point. Feel free to close it if you feel it's not the right approach.

@trentm trentm self-requested a review February 3, 2021 21:14
@trentm trentm self-assigned this Feb 3, 2021
@trentm
Copy link
Member

trentm commented Feb 3, 2021

jenkins run the tests

Trying the jenkins run again to rule out a transient.

@dgieselaar
Copy link
Member Author

jenkins run the tests please

@trentm
Copy link
Member

trentm commented Feb 4, 2021

I'm also wondering if we can use this strategy in our custom instrumentation in Kibana (instead of awaiting process.nextTick or Promise.resolve.

Ah, do you hit the same issue with custom spans? Something like:

// ...
const span = apm.createSpan("doingThatThing")
await esClient.makeACall(...)
// ...
span.end()

Obviously that underscore in agent._instrumentation is a big hint that we'd rather not have usage like that outside the agent itself.

farside-unsafe-chair

@dgieselaar
Copy link
Member Author

@trentm Hmm. Let me rephrase: if we have to pick between await Promise.resolve() and agent._instrumentation.activeSpan = span, what's the least worst option?

@trentm
Copy link
Member

trentm commented Feb 5, 2021

we'd rather not have usage like that outside the agent itself.

I don't know what I was thinking when I said that. This is a patch to the agent itself.

if we have to pick between await Promise.resolve() and agent._instrumentation.activeSpan = span, what's the least worst option?

Promise.resolve - you can do without the agent needing to take this change or to wait for #1890 to do similar. I don't have any idea the potential impact on Kibana perf with it doing that in (all?) its calls to ES.

I do want to get to these changes, but I probably won't get there quickly.

What if the agent had an option to elide the child HTTP span for ES queries altogether? (I'm assuming we can even do this relatively easily in the node agent.) Would that even be preferable for your case? It would effectively workaround this issue until parent-child spans in general are tackled.

@dgieselaar
Copy link
Member Author

Promise.resolve - you can do without the agent needing to take this change or to wait for #1890 to do similar. I don't have any idea the potential impact on Kibana perf with it doing that in (all?) its calls to ES.

I'm currently using the runInAsyncScope workaround: https://github.com/elastic/kibana/pull/90403/files#diff-3aa1ae78694030a50794cbcc433f97b37cfd3084afaa451871d06e220c7366bfR42

awaiting Promise.resolve() works as well, but things get executed in a slightly different order then with runInAsyncScope (AFAICT), so I figured this is more predictable (I couldn't really see a difference in performance on a single request).

What if the agent had an option to elide the child HTTP span for ES queries altogether? (I'm assuming we can even do this relatively easily in the node agent.) Would that even be preferable for your case? It would effectively workaround this issue until parent-child spans in general are tackled.

I'm not sure what the consequences would be there. I do like the idea of less spans, but the spans might have different semantics which might be useful for a specific scenario (e.g., the HTTP span might record the network error, something like that?) I'm okay with postponing this for sure.

What also might be an option is to have the agent add something like a startSpanWithContext(spanOptions, runFn), where the active span for that context is the one that just started, and all spans created in that context are the direct child of that span. And the trigger context stays "unpolluted". Basically adding a more explicit API so there's no breaking change or unexpected consequences.

@AlexanderWert AlexanderWert added this to the 7.14 milestone Apr 26, 2021
@AlexanderWert AlexanderWert modified the milestones: 7.14, 7.15 Jun 1, 2021
@ddolcimascolo
Copy link
Contributor

Hey guys, any progress on that one ? I'm started to get tired of all these double HTTP/Elasticsearch spans 😊

Thx,
David

@dgieselaar
Copy link
Member Author

@ddolcimascolo hello David, that's being tracked in this issue I think: #2000

@trentm
Copy link
Member

trentm commented Jul 28, 2021

@ddolcimascolo I'm currently looking into #2000 (and effectively this ticket as well). Nothing to report yet.

@ddolcimascolo
Copy link
Contributor

Thx guys

@trentm
Copy link
Member

trentm commented Oct 27, 2021

I believe this is fixed by #2181

@trentm trentm closed this Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent-nodejs Make available for APM Agents project planning.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants