poc: original request is not aborted on bodytimeout (only retry handler?)#4470
poc: original request is not aborted on bodytimeout (only retry handler?)#4470Uzlopak wants to merge 2 commits into
Conversation
| // If the error is a body timeout we want to abort the request | ||
| // as the server could be still sending data and we want to avoid | ||
| // to have multiple ongoing requests. | ||
| if (code === 'UND_ERR_BODY_TIMEOUT') { | ||
| if (controller && !controller.aborted) { | ||
| controller.abort() | ||
| } | ||
| shouldRetryCb(err) |
There was a problem hiding this comment.
this feels so hacky.
| setTimeout(() => { | ||
| shouldRetryCb(null) | ||
| }, retryTimeout)?.unref() |
There was a problem hiding this comment.
Why I unref it? tbh... where do we actually clear the timer?
There was a problem hiding this comment.
Is not cleared as is not common request gets aborted on retry, tho having it as safe net when the controller.abort is called seems a good approach
| } | ||
|
|
||
| static [kRetryHandlerDefaultRetry] (err, { state, opts }, cb) { | ||
| static [kRetryHandlerDefaultRetry] (err, { controller, state, opts }, shouldRetryCb) { |
There was a problem hiding this comment.
renamed it to shouldRetryCb so that it is easier to grok what it does.
controller is passed to potentially abort the request.
| opts: { retryOptions: this.retryOpts, ...this.opts } | ||
| }, | ||
| shouldRetry.bind(this) | ||
| shouldRetryCb |
There was a problem hiding this comment.
because shouldRetryCb is now an arrow function, this points to the retry handler anyway.
| setTimeout(() => { res.end('ello world!') }, 100) | ||
| if (callCount++ === 0) { | ||
| res.write('ahahaha') | ||
| // never end the response |
There was a problem hiding this comment.
First i thought, to increase the timeouts, but then i thought: What happens if we never timeout?!
Original issue 3356 was, that we should ensure, that we dont concat the responses. solution was that non-206 responses should throw.
| // never end the response | ||
| } else { | ||
| res.end('hello world!') | ||
| t.fail('should not be called twice') |
There was a problem hiding this comment.
If you run the code on main, you will see that we will call the routehandler twice! This means, that the retry handler makes the request twice. That doesnt seem right if we say, that responses with status 200 will not be able to process responses with content-range
There was a problem hiding this comment.
Responses with 200 can mean that no more data is available (consumed all request) or the server just don't support range-request and will send the whole body instead.
The handler already covered that, but will need to check what was possibly wrong with it
There was a problem hiding this comment.
According to the issue, we had the problem that if we passed the response stream to the target stream, there is no way to "revert" that downstreamed data. E.g. we stream to a file stream, and partial data is written, response stream has issues, now retry, so we begin from the beginning to stream. Bam, double data.
The consensus of that issue was to handle it as an error and define the state of the request/response as non recoverable
Maybe my understanding is wrong. But this means that status 200 means that we dont retry. Of course we could consider that even if status 200 is thrown we retry and see if the response is a partial response with corresponding range headers set.
But i dont see it in my tests?!
We could have of course tried other approaches too. Like make a request and track transferred content on bytes, on error do retry sent range headers in hope it will accept it, and if there are no content-range headers dump bytes till we get new bytes and push them finally to the real stream. Such a behaviour should be configurable.
| after(() => once(server.close(), 'close')) | ||
|
|
||
| const agent = new RetryAgent(new Agent({ bodyTimeout: 50 }), { | ||
| errorCodes: ['UND_ERR_BODY_TIMEOUT'] |
There was a problem hiding this comment.
This means that UND_ERR_BODY_TIMEOUT should retry. But actually we decided, that it shuold not retry?
| await t.completed | ||
| }) | ||
|
|
||
| test('https://github.com/nodejs/undici/issues/3356', { skip: true }, async (t) => { |
There was a problem hiding this comment.
I skipped this test, because it is not working. Maybe the logic for 206 with content-range is wrong.
There was a problem hiding this comment.
Since when is not working, or only not working with the new changes?
There was a problem hiding this comment.
IIRC it also fails on main. But maybe the test setup is bad.
|
IMHO I do not think the solution u are looking for is entirely in retry handler, you have two issues described in the description of the PR you described two issues here
How can it happen, if you destroy the socket on bodyTimeout?
Thats the only implementation that should be in retry handler |
metcoder95
left a comment
There was a problem hiding this comment.
The approach lgtm, if the range-request logic is broken with this changes, we might need to verify the changes or that the retry handler properly processes the range-requests as per spec.
I can try to do that later this week.
About the timer unref, I'd recommend not apply unref as possibly imposes a breaking change (now terminating process won't account for the request about to be retried). Tho, I'm +1 on cleaning the timer and upon request getting aborted.
| await t.completed | ||
| }) | ||
|
|
||
| test('https://github.com/nodejs/undici/issues/3356', { skip: true }, async (t) => { |
There was a problem hiding this comment.
Since when is not working, or only not working with the new changes?
| // never end the response | ||
| } else { | ||
| res.end('hello world!') | ||
| t.fail('should not be called twice') |
There was a problem hiding this comment.
Responses with 200 can mean that no more data is available (consumed all request) or the server just don't support range-request and will send the whole body instead.
The handler already covered that, but will need to check what was possibly wrong with it
| setTimeout(() => { | ||
| shouldRetryCb(null) | ||
| }, retryTimeout)?.unref() |
There was a problem hiding this comment.
Is not cleared as is not common request gets aborted on retry, tho having it as safe net when the controller.abort is called seems a good approach
|
I personally lack the insights in these parts and I think it would great if you would investigate it further. Should i close this PR? |
|
I dont think the socket gets destroyed. Would need to investigate though. But i guess it is because we are not directly working on the h1 client? Anyhow imho undici is acting strange. This PR was just a poc. Maybe everything is fine and i am wrong... |
| // If the error is a body timeout we want to abort the request | ||
| // as the server could be still sending data and we want to avoid | ||
| // to have multiple ongoing requests. | ||
| if (code === 'UND_ERR_BODY_TIMEOUT') { |
There was a problem hiding this comment.
Use the modified test on main. The process hangs unrecoverable.
|
Is there a way / test to repoduce the issue u are describing? Running the test on main brach, it always passes |
|
This is what i see, when i run my modified test on main: aras@aras-HP-ZBook-15-G3:~/workspace/undici$ node test/issue-3356.js
✖ https://github.com/nodejs/undici/issues/3356 (1525.138235ms)
AssertionError [ERR_ASSERTION]: should not be called twice
at res.<computed> [as fail] (/home/aras/workspace/undici/node_modules/@matteo.collina/tspl/tspl.js:58:35)
at Server.<anonymous> (/home/aras/workspace/undici/test/issue-3356.js:23:9)
at Server.emit (node:events:524:28)
at parserOnIncoming (node:_http_server:1141:12)
at HTTPParser.parserOnHeadersComplete (node:_http_common:118:17) {
generatedMessage: false,
code: 'ERR_ASSERTION',
actual: undefined,
expected: undefined,
operator: 'fail'
}
﹣ https://github.com/nodejs/undici/issues/3356 (0.122966ms) # SKIP
The process hangs... |
Sure, I can do that |
That sounds like incorrect test.. Its expected to be called twice, since this is the purpose of retry on timeout |
|
Did you read what I wrote? Did you read the corresponding issue? Exactly the opposite of what you wrote is the expected behavior. |
I read what u wrote, and this is exactly what Im saying
Retry is expected on timeout, the second call to the API is expected this is the purpose of retry mechanism, what is not expected, is that the old socket still be active after timeout So how is it the opossite? |
|
If the server does not support ranges, no range headers are sent, no 206 status code and maybe no etag to verify, then we should not retry if a body was already sent. The test does simulates a case which does not meet the conditions for a retry, but the retry handler does the retry anway. |
Thank you for clarification, and still it sounds like the test is wrong, the logic u described is not handled right now in If its not set, the error will be thrown and bubble up to this line Line 314 in f182ff1 which AFAIK basically does the same thing as The logic you described is handled in a method please correct me if I get you wrong again. |
This PR is far from perfect! Actually started as an approach to deflake test/isue-3356.js.
Please have a look at the code and the tests. It is hard to explain:
It seems that the flakyness actually shows that we have some underlying issue. If we get a bodyTimeout, it doesnt mean that the connection is closed. It just means that the body did not finish in the expected time. Ok, no problem is the connection closes after some time. But if we have a body timeout and the connection is still open and potentially still sending data, then some undefined behavior happens.
So I assume, we have to abort the request before we retry with a new request. Maybe even check how much data we buffered already, and set the corresponding content-range headers.
But tbh. I am kind of lost in this part of the code. So I show you this, maybe you have better ideas.
This relates to...
Rationale
Changes
Features
Bug Fixes
Breaking Changes and Deprecations
Status