Skip to content

poc: original request is not aborted on bodytimeout (only retry handler?)#4470

Open
Uzlopak wants to merge 2 commits into
mainfrom
deflake-3356
Open

poc: original request is not aborted on bodytimeout (only retry handler?)#4470
Uzlopak wants to merge 2 commits into
mainfrom
deflake-3356

Conversation

@Uzlopak

@Uzlopak Uzlopak commented Aug 26, 2025

Copy link
Copy Markdown
Contributor

This PR is far from perfect! Actually started as an approach to deflake test/isue-3356.js.

Please have a look at the code and the tests. It is hard to explain:

It seems that the flakyness actually shows that we have some underlying issue. If we get a bodyTimeout, it doesnt mean that the connection is closed. It just means that the body did not finish in the expected time. Ok, no problem is the connection closes after some time. But if we have a body timeout and the connection is still open and potentially still sending data, then some undefined behavior happens.

So I assume, we have to abort the request before we retry with a new request. Maybe even check how much data we buffered already, and set the corresponding content-range headers.

But tbh. I am kind of lost in this part of the code. So I show you this, maybe you have better ideas.

This relates to...

Rationale

Changes

Features

Bug Fixes

Breaking Changes and Deprecations

Status

@Uzlopak Uzlopak requested review from mcollina and metcoder95 August 26, 2025 13:43
Comment on lines +164 to +171
// If the error is a body timeout we want to abort the request
// as the server could be still sending data and we want to avoid
// to have multiple ongoing requests.
if (code === 'UND_ERR_BODY_TIMEOUT') {
if (controller && !controller.aborted) {
controller.abort()
}
shouldRetryCb(err)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels so hacky.

Comment on lines +194 to +196
setTimeout(() => {
shouldRetryCb(null)
}, retryTimeout)?.unref()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why I unref it? tbh... where do we actually clear the timer?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not cleared as is not common request gets aborted on retry, tho having it as safe net when the controller.abort is called seems a good approach

}

static [kRetryHandlerDefaultRetry] (err, { state, opts }, cb) {
static [kRetryHandlerDefaultRetry] (err, { controller, state, opts }, shouldRetryCb) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it to shouldRetryCb so that it is easier to grok what it does.
controller is passed to potentially abort the request.

opts: { retryOptions: this.retryOpts, ...this.opts }
},
shouldRetry.bind(this)
shouldRetryCb

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because shouldRetryCb is now an arrow function, this points to the retry handler anyway.

Comment thread test/issue-3356.js
setTimeout(() => { res.end('ello world!') }, 100)
if (callCount++ === 0) {
res.write('ahahaha')
// never end the response

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First i thought, to increase the timeouts, but then i thought: What happens if we never timeout?!

Original issue 3356 was, that we should ensure, that we dont concat the responses. solution was that non-206 responses should throw.

Comment thread test/issue-3356.js
// never end the response
} else {
res.end('hello world!')
t.fail('should not be called twice')

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run the code on main, you will see that we will call the routehandler twice! This means, that the retry handler makes the request twice. That doesnt seem right if we say, that responses with status 200 will not be able to process responses with content-range

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responses with 200 can mean that no more data is available (consumed all request) or the server just don't support range-request and will send the whole body instead.

The handler already covered that, but will need to check what was possibly wrong with it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the issue, we had the problem that if we passed the response stream to the target stream, there is no way to "revert" that downstreamed data. E.g. we stream to a file stream, and partial data is written, response stream has issues, now retry, so we begin from the beginning to stream. Bam, double data.

The consensus of that issue was to handle it as an error and define the state of the request/response as non recoverable

Maybe my understanding is wrong. But this means that status 200 means that we dont retry. Of course we could consider that even if status 200 is thrown we retry and see if the response is a partial response with corresponding range headers set.
But i dont see it in my tests?!

We could have of course tried other approaches too. Like make a request and track transferred content on bytes, on error do retry sent range headers in hope it will accept it, and if there are no content-range headers dump bytes till we get new bytes and push them finally to the real stream. Such a behaviour should be configurable.

Comment thread test/issue-3356.js
after(() => once(server.close(), 'close'))

const agent = new RetryAgent(new Agent({ bodyTimeout: 50 }), {
errorCodes: ['UND_ERR_BODY_TIMEOUT']

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that UND_ERR_BODY_TIMEOUT should retry. But actually we decided, that it shuold not retry?

Comment thread test/issue-3356.js
await t.completed
})

test('https://github.com/nodejs/undici/issues/3356', { skip: true }, async (t) => {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skipped this test, because it is not working. Maybe the logic for 206 with content-range is wrong.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since when is not working, or only not working with the new changes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it also fails on main. But maybe the test setup is bad.

@fatal10110

Copy link
Copy Markdown
Contributor

IMHO I do not think the solution u are looking for is entirely in retry handler, you have two issues described in the description of the PR

you described two issues here

But if we have a body timeout and the connection is still open and potentially still sending data, then some undefined behavior happens.

How can it happen, if you destroy the socket on bodyTimeout?

util.destroy(socket, new BodyTimeoutError())

Maybe even check how much data we buffered already, and set the corresponding content-range headers.

Thats the only implementation that should be in retry handler
There is no reason to stop the retry process on body timeout AFAIK

@metcoder95 metcoder95 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach lgtm, if the range-request logic is broken with this changes, we might need to verify the changes or that the retry handler properly processes the range-requests as per spec.

I can try to do that later this week.

About the timer unref, I'd recommend not apply unref as possibly imposes a breaking change (now terminating process won't account for the request about to be retried). Tho, I'm +1 on cleaning the timer and upon request getting aborted.

Comment thread test/issue-3356.js
await t.completed
})

test('https://github.com/nodejs/undici/issues/3356', { skip: true }, async (t) => {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since when is not working, or only not working with the new changes?

Comment thread test/issue-3356.js
// never end the response
} else {
res.end('hello world!')
t.fail('should not be called twice')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responses with 200 can mean that no more data is available (consumed all request) or the server just don't support range-request and will send the whole body instead.

The handler already covered that, but will need to check what was possibly wrong with it

Comment on lines +194 to +196
setTimeout(() => {
shouldRetryCb(null)
}, retryTimeout)?.unref()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not cleared as is not common request gets aborted on retry, tho having it as safe net when the controller.abort is called seems a good approach

@Uzlopak

Uzlopak commented Aug 30, 2025

Copy link
Copy Markdown
Contributor Author

@metcoder95

I personally lack the insights in these parts and I think it would great if you would investigate it further. Should i close this PR?

@Uzlopak

Uzlopak commented Aug 30, 2025

Copy link
Copy Markdown
Contributor Author

@fatal10110

I dont think the socket gets destroyed. Would need to investigate though. But i guess it is because we are not directly working on the h1 client?
Idk.

Anyhow imho undici is acting strange. This PR was just a poc. Maybe everything is fine and i am wrong...

// If the error is a body timeout we want to abort the request
// as the server could be still sending data and we want to avoid
// to have multiple ongoing requests.
if (code === 'UND_ERR_BODY_TIMEOUT') {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the modified test on main. The process hangs unrecoverable.

@artur-ma

Copy link
Copy Markdown
Contributor

Is there a way / test to repoduce the issue u are describing? Running the test on main brach, it always passes

@Uzlopak

Uzlopak commented Aug 31, 2025

Copy link
Copy Markdown
Contributor Author

@artur-ma

This is what i see, when i run my modified test on main:

aras@aras-HP-ZBook-15-G3:~/workspace/undici$ node test/issue-3356.js 
✖ https://github.com/nodejs/undici/issues/3356 (1525.138235ms)
  AssertionError [ERR_ASSERTION]: should not be called twice
      at res.<computed> [as fail] (/home/aras/workspace/undici/node_modules/@matteo.collina/tspl/tspl.js:58:35)
      at Server.<anonymous> (/home/aras/workspace/undici/test/issue-3356.js:23:9)
      at Server.emit (node:events:524:28)
      at parserOnIncoming (node:_http_server:1141:12)
      at HTTPParser.parserOnHeadersComplete (node:_http_common:118:17) {
    generatedMessage: false,
    code: 'ERR_ASSERTION',
    actual: undefined,
    expected: undefined,
    operator: 'fail'
  }

﹣ https://github.com/nodejs/undici/issues/3356 (0.122966ms) # SKIP

The process hangs...

@metcoder95

Copy link
Copy Markdown
Member

I personally lack the insights in these parts and I think it would great if you would investigate it further. Should i close this PR?

Sure, I can do that

@artur-ma

artur-ma commented Sep 4, 2025

Copy link
Copy Markdown
Contributor

@Uzlopak

@artur-ma

This is what i see, when i run my modified test on main:

aras@aras-HP-ZBook-15-G3:~/workspace/undici$ node test/issue-3356.js 
✖ https://github.com/nodejs/undici/issues/3356 (1525.138235ms)
  AssertionError [ERR_ASSERTION]: should not be called twice
      at res.<computed> [as fail] (/home/aras/workspace/undici/node_modules/@matteo.collina/tspl/tspl.js:58:35)
      at Server.<anonymous> (/home/aras/workspace/undici/test/issue-3356.js:23:9)
      at Server.emit (node:events:524:28)
      at parserOnIncoming (node:_http_server:1141:12)
      at HTTPParser.parserOnHeadersComplete (node:_http_common:118:17) {
    generatedMessage: false,
    code: 'ERR_ASSERTION',
    actual: undefined,
    expected: undefined,
    operator: 'fail'
  }

﹣ https://github.com/nodejs/undici/issues/3356 (0.122966ms) # SKIP

The process hangs...

That sounds like incorrect test.. Its expected to be called twice, since this is the purpose of retry on timeout
From what I understand, the case you are trying to fix is another one, that both sockets are active (data is written in to it) at the same time because the first socket wasnt destryed.

@Uzlopak

Uzlopak commented Sep 6, 2025

Copy link
Copy Markdown
Contributor Author

@artur-ma

Did you read what I wrote? Did you read the corresponding issue?

Exactly the opposite of what you wrote is the expected behavior.

@artur-ma

artur-ma commented Sep 7, 2025

Copy link
Copy Markdown
Contributor

@Uzlopak

@artur-ma

Did you read what I wrote? Did you read the corresponding issue?

Exactly the opposite of what you wrote is the expected behavior.

I read what u wrote, and this is exactly what Im saying

But if we have a body timeout and the connection is still open and potentially still sending data, then some undefined behavior happens.

Retry is expected on timeout, the second call to the API is expected this is the purpose of retry mechanism, what is not expected, is that the old socket still be active after timeout

So how is it the opossite?

@Uzlopak

Uzlopak commented Sep 7, 2025

Copy link
Copy Markdown
Contributor Author

@artur-ma

If the server does not support ranges, no range headers are sent, no 206 status code and maybe no etag to verify, then we should not retry if a body was already sent. The test does simulates a case which does not meet the conditions for a retry, but the retry handler does the retry anway.

@artur-ma

artur-ma commented Sep 9, 2025

Copy link
Copy Markdown
Contributor

@Uzlopak

@artur-ma

If the server does not support ranges, no range headers are sent, no 206 status code and maybe no etag to verify, then we should not retry if a body was already sent. The test does simulates a case which does not meet the conditions for a retry, but the retry handler does the retry anway.

Thank you for clarification, and still it sounds like the test is wrong, the logic u described is not handled right now in mian, so after adding errorCodes: ['UND_ERR_BODY_TIMEOUT'] as RetryAgent option it is expected to be retried, as u are setting it explicitly as retryable error code.

If its not set, the error will be thrown and bubble up to this line

this.aborted = true

which AFAIK basically does the same thing as controller.abort() that u used in this PR
The only difference here is that someone can catch the error down the line

The logic you described is handled in a method onResponseStart Which means, if the request was retried, on the retry process on consuming the data on the second time, we indicate that this error shouldn't be retried in the first place (this flow should not happen as body timeout by default throws, and no retry happens)

please correct me if I get you wrong again.

@mcollina

mcollina commented Jan 3, 2026

Copy link
Copy Markdown
Member

@Uzlopak @artur-ma any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants