[RECEIVE] Make out of bounds a non-recoverable error by wiardvanrij · Pull Request #5131 · thanos-io/thanos

wiardvanrij · 2022-02-06T00:33:29Z

Signed-off-by: Wiard van Rij wvanrij@roku.com

Relates to: #4831

Prometheus' remote write receiver treats ErrOutOfOrder as non-recoverable.
prometheus/prometheus#5649 (comment)_

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Out of bounds would result in a http 409, which causes Prometheus to retry, eventually causing an infinite loop which can only be solved by flushing the entire Prometheus data dir.

This change removes OutOfBounds err from ConflictErr to it's own entity and it will allow Prometheus to deal with this as non-recoverable (i.e. dropping it, instead of retry).

Also made a small adjustment in the Debug.Log, as it could not print the sample, which causes a confusing debug message like: sample="unsupported value type" - Now it will print the sample value and timestamp.

Verification

I mean, i'll be honest that it was a pain in the a. to test and debug this. What I have checked is that Prometheus now handles this as non-recoverable and nothing will build up anymore as pending.

I have adjusted the test cases and added the OutOfBounds for it.

Signed-off-by: Wiard van Rij <wvanrij@roku.com>

Kampe · 2022-02-07T21:12:35Z

I appreciate the time you took to find this case!

matej-g · 2022-02-10T16:26:48Z

pkg/receive/handler.go

+	case errOutOfBounds:
+		http.Error(w, err.Error(), http.StatusBadRequest)


Maybe I'm missing something here, but my understanding was that Prometheus treats as recoverable only network errors, 5xx responses and, if configured to, 429 responses (see this line and below).

It therefore should not make difference if we're responding with bad request (400) or conflict (409), since conflict is already unrecoverable, right?

For Prometheus, 400 = unrecoverable. If that's your question? Also, if that's untrue, than I'm happy to get pointed to some resource. However, I've tested this and the response now becomes unrecoverable for Prometheus.

As far as I understood, only 5xx codes are recoverable for Prometheus (see the link in my comment). Meaning if until now we were returning 409 for out of bounds error, it should have already been working as expected, i.e. there should not be a real difference between the codes. As long as we return any 4xx code, it should be treated as unrecoverable (with the possible exception of 429).

Right, I understand you and the code (missed the link) now. Funny thing is though, that I did not experience this behaviour. Only after my change I actually see the non-recoverable errors in the Prometheus logs. 🤔

More on this: prometheus/prometheus#5649 (comment)

looking forward to seeing this fixed. it's causing us pain in production.

@wiardvanrij I did a small test today: I spun up a VM where I skewed the clock on purpose (made it run in the 'past' compared to my host machine), I brought up an instance of Prometheus, configured to remote write to Thanos instance running on my host.

I first run it against Thanos build from latest main. The result - Prometheus informs me this is a non-recoverable error:

ts=2022-02-12T12:26:39.089Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=9 exemplarCount=0 err="server returned HTTP status 409 Conflict: store locally for endpoint 127.0.0.1:10901: conflict"

Built from your branch:

ts=2022-02-12T12:34:32.266Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=63 exemplarCount=0 err="server returned HTTP status 400 Bad Request: store locally for endpoint 127.0.0.1:10901: out of bounds"

I haven't noticed any unexpected behavior on either side (Thanos or Prometheus). Prometheus treated it (in both cases) simply as unrecoverable error.

thanks @matej-g really appreciate the testing - Then I need more user input. @Kampe did you have any chance to test it or could you provide more info? Because if we now look at the code, it already should be correct. I would agree that we sometimes see odd behaviour but then we really need to know why that is happening.

If I have some more time, i'll try to figure things out as well.

I responded in the issue ticket #4831, but there definitely seems to be two issues happening with 500 level errors and 409's. Unsure if they are related, but it seems we can on atleast a monthly cadence (sometimes weekly) can produce them.

We have not had the chance to roll the changes to any of our environments, we don't have much in the way of tooling setup to be able to roll non registry hosted third party tools to even developer environments so testing without a prebuilt container for us is pretty intensive, however when we have a container we can reference I can get that rolled in a matter of minutes.

phillebaba · 2022-02-21T21:51:55Z

@wiardvanrij do you still need some testing for this feature? I think I experience this issue with a specific tenant which causes the receiver to consume a lot more memory than usual during certain scenarios, to the point where it will eventually OOM kill. Looking at the logs it seems like the issue is consistent with the description as the Prometheus agents will continue to retry the same failed out of order requests over and over again.

phillebaba · 2022-03-14T15:11:15Z

Just wanted to add that I migrated to the new router->ingestor model and it solved most of my issues. I am still seeing a few out of bounds errors but they are no where near where they were before.

wiardvanrij · 2022-04-26T23:57:25Z

@wiardvanrij do you still need some testing for this feature? I think I experience this issue with a specific tenant which causes the receiver to consume a lot more memory than usual during certain scenarios, to the point where it will eventually OOM kill. Looking at the logs it seems like the issue is consistent with the description as the Prometheus agents will continue to retry the same failed out of order requests over and over again.

sorry for the lag.
The problem with this issue is that it's very hard to replicate and for some reason, I thought I had fixed it (as my tests were successful) but it does not make any sense given the input I've got here.
I think we might just have to close this MR and follow up in the issue?

gabrielSoudry · 2022-05-27T06:10:39Z

Can you merge this please ? The message "unsupported value type" is very confusing and we take a lot of time to figure out what's going on ... (And we still have a lot of 409).

Our solution : send less metrics...
Thanks you !

wiardvanrij · 2022-05-27T07:52:49Z

No, this isn't the solution. Please follow up on the issue with input.

wiardvanrij added 2 commits February 6, 2022 01:22

Make out of bound an unrecoverable error

d43c914

Signed-off-by: Wiard van Rij <wvanrij@roku.com>

revert removed lset

e650118

Signed-off-by: Wiard van Rij <wvanrij@roku.com>

pull-request-size bot added the size/M label Feb 6, 2022

change switch err var to be consistent

8ead0db

Signed-off-by: Wiard van Rij <wvanrij@roku.com>

wiardvanrij requested a review from yeya24 February 8, 2022 14:37

matej-g reviewed Feb 10, 2022

View reviewed changes

wiardvanrij closed this May 27, 2022

douglascamata mentioned this pull request Dec 14, 2022

Thanos Receive: "Out of bounds metric" #4831

Open

		case errOutOfBounds:
		http.Error(w, err.Error(), http.StatusBadRequest)

Conversation

wiardvanrij commented Feb 6, 2022

Changes

Verification

Uh oh!

Kampe commented Feb 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kampe Feb 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phillebaba commented Feb 21, 2022

Uh oh!

phillebaba commented Mar 14, 2022

Uh oh!

wiardvanrij commented Apr 26, 2022

Uh oh!

gabrielSoudry commented May 27, 2022

Uh oh!

wiardvanrij commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Kampe Feb 14, 2022 •

edited

Loading

wiardvanrij commented May 27, 2022 •

edited

Loading