Describe the bug
A few orchestrators have mentioned issues with the latest release (0.5.35). I upgraded on Monday and noticed a significant change in stream stability. Large spikes and crashes were common. There were a lot of “connection reset by peer” errors. After extensive troubleshooting, a rollback to 0.5.34 today restored stability and we performed better than any day this week.
I think PR #2628 might be related to these connections dropping prematurely.
To Reproduce
Steps to reproduce the behavior:
While operating orchestrator in production, observe logs for connection resets and monitor per stream metrics. Many are associated with “connection reset by peer” errors.
Expected behavior
After some research it seems the idle timeout may not be the best way to handle latent connections https://stackoverflow.com/questions/29334407/creating-an-idle-timeout-in-go
At first I thought perhaps 8 seconds was just too low, but it seems that separate Read and Write time-out may perform better than Idle timeout. There is also an event handler option. See stackoverflow link above.
Screenshots
None
Desktop (please complete the following information):
Additional context
Describe the bug
A few orchestrators have mentioned issues with the latest release (0.5.35). I upgraded on Monday and noticed a significant change in stream stability. Large spikes and crashes were common. There were a lot of “connection reset by peer” errors. After extensive troubleshooting, a rollback to 0.5.34 today restored stability and we performed better than any day this week.
I think PR #2628 might be related to these connections dropping prematurely.
To Reproduce
Steps to reproduce the behavior:
While operating orchestrator in production, observe logs for connection resets and monitor per stream metrics. Many are associated with “connection reset by peer” errors.
Expected behavior
After some research it seems the idle timeout may not be the best way to handle latent connections https://stackoverflow.com/questions/29334407/creating-an-idle-timeout-in-go
At first I thought perhaps 8 seconds was just too low, but it seems that separate Read and Write time-out may perform better than Idle timeout. There is also an event handler option. See stackoverflow link above.
Screenshots
None
Desktop (please complete the following information):
Additional context