[RECEIVE] Make out of bounds a non-recoverable error#5131
[RECEIVE] Make out of bounds a non-recoverable error#5131wiardvanrij wants to merge 3 commits intothanos-io:mainfrom
Conversation
Signed-off-by: Wiard van Rij <wvanrij@roku.com>
Signed-off-by: Wiard van Rij <wvanrij@roku.com>
Signed-off-by: Wiard van Rij <wvanrij@roku.com>
|
I appreciate the time you took to find this case! |
| case errOutOfBounds: | ||
| http.Error(w, err.Error(), http.StatusBadRequest) |
There was a problem hiding this comment.
Maybe I'm missing something here, but my understanding was that Prometheus treats as recoverable only network errors, 5xx responses and, if configured to, 429 responses (see this line and below).
It therefore should not make difference if we're responding with bad request (400) or conflict (409), since conflict is already unrecoverable, right?
There was a problem hiding this comment.
For Prometheus, 400 = unrecoverable. If that's your question? Also, if that's untrue, than I'm happy to get pointed to some resource. However, I've tested this and the response now becomes unrecoverable for Prometheus.
There was a problem hiding this comment.
As far as I understood, only 5xx codes are recoverable for Prometheus (see the link in my comment). Meaning if until now we were returning 409 for out of bounds error, it should have already been working as expected, i.e. there should not be a real difference between the codes. As long as we return any 4xx code, it should be treated as unrecoverable (with the possible exception of 429).
There was a problem hiding this comment.
Right, I understand you and the code (missed the link) now. Funny thing is though, that I did not experience this behaviour. Only after my change I actually see the non-recoverable errors in the Prometheus logs. 🤔
There was a problem hiding this comment.
More on this: prometheus/prometheus#5649 (comment)
There was a problem hiding this comment.
looking forward to seeing this fixed. it's causing us pain in production.
There was a problem hiding this comment.
@wiardvanrij I did a small test today: I spun up a VM where I skewed the clock on purpose (made it run in the 'past' compared to my host machine), I brought up an instance of Prometheus, configured to remote write to Thanos instance running on my host.
I first run it against Thanos build from latest main. The result - Prometheus informs me this is a non-recoverable error:
ts=2022-02-12T12:26:39.089Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=9 exemplarCount=0 err="server returned HTTP status 409 Conflict: store locally for endpoint 127.0.0.1:10901: conflict"
Built from your branch:
ts=2022-02-12T12:34:32.266Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=63 exemplarCount=0 err="server returned HTTP status 400 Bad Request: store locally for endpoint 127.0.0.1:10901: out of bounds"
I haven't noticed any unexpected behavior on either side (Thanos or Prometheus). Prometheus treated it (in both cases) simply as unrecoverable error.
There was a problem hiding this comment.
thanks @matej-g really appreciate the testing - Then I need more user input. @Kampe did you have any chance to test it or could you provide more info? Because if we now look at the code, it already should be correct. I would agree that we sometimes see odd behaviour but then we really need to know why that is happening.
If I have some more time, i'll try to figure things out as well.
There was a problem hiding this comment.
I responded in the issue ticket #4831, but there definitely seems to be two issues happening with 500 level errors and 409's. Unsure if they are related, but it seems we can on atleast a monthly cadence (sometimes weekly) can produce them.
We have not had the chance to roll the changes to any of our environments, we don't have much in the way of tooling setup to be able to roll non registry hosted third party tools to even developer environments so testing without a prebuilt container for us is pretty intensive, however when we have a container we can reference I can get that rolled in a matter of minutes.
|
@wiardvanrij do you still need some testing for this feature? I think I experience this issue with a specific tenant which causes the receiver to consume a lot more memory than usual during certain scenarios, to the point where it will eventually OOM kill. Looking at the logs it seems like the issue is consistent with the description as the Prometheus agents will continue to retry the same failed out of order requests over and over again. |
|
Just wanted to add that I migrated to the new router->ingestor model and it solved most of my issues. I am still seeing a few out of bounds errors but they are no where near where they were before. |
sorry for the lag. |
|
Can you merge this please ? The message "unsupported value type" is very confusing and we take a lot of time to figure out what's going on ... (And we still have a lot of 409). Our solution : send less metrics... |
|
No, this isn't the solution. Please follow up on the issue with input. |
Signed-off-by: Wiard van Rij wvanrij@roku.com
Relates to: #4831
Prometheus' remote write receiver treats
ErrOutOfOrderas non-recoverable.prometheus/prometheus#5649 (comment)_
Changes
Out of bounds would result in a http 409, which causes Prometheus to retry, eventually causing an infinite loop which can only be solved by flushing the entire Prometheus data dir.
This change removes OutOfBounds err from ConflictErr to it's own entity and it will allow Prometheus to deal with this as non-recoverable (i.e. dropping it, instead of retry).
Also made a small adjustment in the Debug.Log, as it could not print the sample, which causes a confusing debug message like:
sample="unsupported value type"- Now it will print the sample value and timestamp.Verification
I mean, i'll be honest that it was a pain in the a. to test and debug this. What I have checked is that Prometheus now handles this as non-recoverable and nothing will build up anymore as pending.
I have adjusted the test cases and added the OutOfBounds for it.