client: fix potential panic during RPC retries#5323
Conversation
stream.go
Outdated
| } | ||
|
|
||
| op := func(a *csAttempt) error { return a.newStream() } | ||
| // Because this operation is always called either here (while creating the |
There was a problem hiding this comment.
Comment what this op does? That it create transport and stream, and only replaces the attempt if both are successful.
stream.go
Outdated
| op := func(a *csAttempt) error { return a.newStream() } | ||
| // Because this operation is always called either here (while creating the | ||
| // clientStream) or by the retry code while locked when replaying the | ||
| // operation, it is safe to access cs.attempt directly. |
There was a problem hiding this comment.
Move this comment right above cs.attempt = a?
| return err | ||
| } | ||
| if lastErr = cs.replayBufferLocked(); lastErr == nil { | ||
| if lastErr = cs.replayBufferLocked(attempt); lastErr == nil { |
There was a problem hiding this comment.
Comment that the first op will set the attempt if it gets a transport and a stream.
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes cockroachdb#80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes cockroachdb#80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
dfawley
left a comment
There was a problem hiding this comment.
Thanks for the review. All comments addressed.
| return err | ||
| } | ||
| if lastErr = cs.replayBufferLocked(); lastErr == nil { | ||
| if lastErr = cs.replayBufferLocked(attempt); lastErr == nil { |
stream.go
Outdated
| } | ||
|
|
||
| op := func(a *csAttempt) error { return a.newStream() } | ||
| // Because this operation is always called either here (while creating the |
stream.go
Outdated
| op := func(a *csAttempt) error { return a.newStream() } | ||
| // Because this operation is always called either here (while creating the | ||
| // clientStream) or by the retry code while locked when replaying the | ||
| // operation, it is safe to access cs.attempt directly. |
|
Just curious if this could get backported to 1.46.x or is the change too large. |
80476: security: update the TLS cipher suite list r=bdarnell a=knz Fixes #80483 This does not really change the list, it merely explains more clearly how it was built. Release note: None 80635: release: use preflight tool for RedHat images r=celiala a=rail RedHat Connect has introduced a new certification workflow, which requires running some tests locally and submit the results back to RedHat. This patch runs the `preflight` tool to address the new workflow changes. Fixes #80633 Release note: None 80878: util/tracing: fix crash in StreamClientInterceptor r=andreimatei a=andreimatei Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go#5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes #80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Yes, I believe this should be backported. |
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
|
Quick question: in the case of server returns |
|
|
Fixes #5315
Before this change, there was a very narrow window where a stream could fail to be created on a retry attempt, and subsequent stream operations could panic when this occurred. This change is a fairly substantial restructuring of the retry code in order to guarantee that for all clientStreams returned to the user,
cs.attempt.sis always non-nil. New attempts that don't yet have a stream are created incs.newAttemptuntil they pick a transport and create a stream, at which time they are promoted tocs.attempt.This also allows us to run balancer pick results through the retry code, meaning UNAVAILABLE errors can be retried by the retry policy, as intended. It also handles drop results that should explicitly avoid the retry code.
RELEASE NOTES: