Upgrade raft lib and fix group checksum#3085
Merged
manishrjain merged 13 commits intomasterfrom Mar 3, 2019
Merged
Conversation
golangcibot
reviewed
Mar 2, 2019
| ErrUnservedTabletMessage = "Tablet isn't being served by this instance" | ||
| errUnservedTablet = x.Errorf(ErrUnservedTabletMessage) | ||
| errUnservedTablet = x.Errorf("Tablet isn't being served by this instance.") | ||
| errPredicateMoving = x.Errorf("Predicate is being moved. Please retry later") |
There was a problem hiding this comment.
errPredicateMoving is unused (from varcheck)
…sums for groups, because map does random ordering of keys.
martinmr
approved these changes
Mar 3, 2019
Contributor
martinmr
left a comment
There was a problem hiding this comment.
Reviewed 27 of 30 files at r1, 1 of 10 files at r3, 7 of 7 files at r4.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @golangcibot)
worker/mutation.go, line 43 at r3 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
errPredicateMovingis unused (fromvarcheck)
This error is unused. Should it be removed?
manishrjain
commented
Mar 3, 2019
Contributor
Author
manishrjain
left a comment
There was a problem hiding this comment.
Reviewable status: 34 of 35 files reviewed, 1 unresolved discussion (waiting on @golangcibot and @martinmr)
worker/mutation.go, line 43 at r3 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
This error is unused. Should it be removed?
Done.
manishrjain
added a commit
that referenced
this pull request
Mar 3, 2019
Found 2 issues recently: 1. If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue. 2. Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking). Longer version for issue 1: The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent. This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through `AmLeader()` checks, but then failed after `raft.Propose`. Because raft did not complain about the proposal, the proposal was getting retried indefinitely. Changes: * Update raft lib to use master * Fix the issue where different Zeros would end up with different checksums for groups, because map does random ordering of keys. * Remove unused var
dna2github
pushed a commit
to dna2fork/dgraph
that referenced
this pull request
Jul 19, 2019
Found 2 issues recently: 1. If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue. 2. Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking). Longer version for issue 1: The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent. This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through `AmLeader()` checks, but then failed after `raft.Propose`. Because raft did not complain about the proposal, the proposal was getting retried indefinitely. Changes: * Update raft lib to use master * Fix the issue where different Zeros would end up with different checksums for groups, because map does random ordering of keys. * Remove unused var
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Found 2 issues recently:
If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue.
Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking).
Longer version for issue 1:
The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent.
This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through
AmLeader()checks, but then failed afterraft.Propose. Because raft did not complain about the proposal, the proposal was getting retried indefinitely.This change is