-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Remove block limit per level #2933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove block limit per level #2933
Conversation
|
The limit was introduced to prevent parachain collators from producing blocks at the same level if the relay chain is stuck for whatever reason.
Having a lot of sibling blocks is not something well supported and not covered by our tests. Allowing it is also a risk.
This PR was authored not to work around the error, but to enforce correct behviour. @bkchr What do you think? |
|
Thanks for clarifications, this makes more sense to me right now. I agree, that for parachains, where we don't have any clear upper bound on the number of sibling block in the worst case, removing this can be risky. I'm thinking, though, that for the relay chain, or solo chains developed using substrate, it should be fine to not have any hard-coded limit. The difference is that in this case we have anti-equivocation logic implemented (not present in parachains if I'm not mistaken), so the number of blocks on the same level is bounded by the number of validators in active era kind of naturally, and allowing such a number of blocks imho shouldn't be much of a problem. If we don't want to remove this limit, maybe we can move enforcing the limit only to parachain related codebase? |
|
Maybe you could start explaining your problem @ggawryal. Did you hit this 32 blocks limit and why? I mean I would also like to remove this limit, but I know that the limit exists because the current implementation is probably not that well behaving if there is an unbounded number of blocks at the same height. So, this would require more involved changes to the lower layers than just removing the limit as you have done here. |
|
I've been working on Aleph Zero blockchain, and we use the AlephBFT consensus there. Its main difference with GRANDPA from blockchain's POV is that the finalization in AlephBFT is not strongly related to the longest chain rule, and it can sometimes "choose" to finalize a block on some short branch. Therefore, we've adapted the block sync mechanism to use also request-response protocol in support of notification protocol (actually, we're rewriting block sync protocol to suit our consensus better, but the main idea with the request-response protocol in substrate is roughly the same). Because of that, and taking into account also shorter block time (1s), I'm slightly concerned about hitting the sibling limit at some point in time. Currently, this is only a theoretical consideration, as the largest number of sibling blocks we had on the chain was around 6 IIRC. However, as we plan to decentralize the network more, and increase the number of validators per session, this limit would be easier to reach. Of course, accidentally reaching the limit itself wouldn't be the cause of a problem, but rather a point in a sequence of failures occurring after some event, at which bringing the blockchain back to the correct state would be quite annoying. Ideally, we'd like to remove that limit totally for Aleph Zero blockchain, or increase the limit to be at least the number of validators per session. Workarounds, like pruning some leaves from the db, are not an option, as most nodes can't predict well which blocks are the least likely to be finalized. Also, even if we were able to predict it somehow, any block sync mechanism using a block request protocol from substrate would be problematic in such a case - a single malicious party could use it to request any block we've dropped. This problem can be easily missed, as the sibling limit is documented only in the documentation of |
|
@ggawryal could you please either do some experiments with your pr and a high number of forks at the same height or make the parameter configurable? |
|
The CI pipeline was cancelled due to failure one of the required jobs. |
|
Added some basic experiments, importing a bunch of forks seems to happen as fast on |
bkchr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can go ahead with this one. However, I would not make it configurable. So, please revert the changes in config.rs etc.
|
I've reverted the changes - you can take a look now |
| for index in 0..MAX_BLOCKS_PER_LEVEL { | ||
| let level_span = db | ||
| .get_meta(&to_meta_key(OVERLAY_LEVEL_SPAN, &block)) | ||
| .map_err(Error::Db)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that break compatibility with existing databases taht don't have the OVERLAY_LEVEL_SPAN key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think MetaDb::get_meta in case of a missing key returns Ok(None), right? If so, this will default to iterating through indices 0..OVERLAY_LEVEL_STORE_SPANS_LONGER_THAN, which is the same as the old behavior.
|
@bkchr can you please add another review? |
|
@bkchr gentle ping |
|
@bkchr gentle ping |
|
@bkchr could you or maybe someone else review please? |
Description
In paritytech/substrate#8494, there was introduced a limit for the number of blocks with the same number, that could be stored in the database. If there are more than 32 such blocks, then
StateDbError::TooManySiblingBlocksis raised.This PR removes that limit, by making the
state-dbnot enforcing any artificial limits on that number.While, as noted in the pull request creating that limit, having that many validated blocks at the same level would be something very unusual, the limit itself is a hidden, unconfigurable assumption added to the substrate framework. This can be considered as some kind of risk, particularly taking into account the possible consequences of exceeding it. I think this fixed issue is somewhat an evidence paritytech/cumulus#1559.
To my knowledge, this is the only substrate component that requires any assumptions on that number, and moreover is used only for loading noncanonicalized journal from disk after the restart.
The changes are implemented by:
spanof the noncanonical overlay level, which is the largest index ever used on that level (highly inspired by the alternative solution mentioned in Fixed restoring state-db journals on startup substrate#8494). Later, this value is added to the commit, so that we know how many blocks should we expect when loading the journal from db. However, to avoid unnecessary operations to the db, it is only written when it is larger than theOVERLAY_LEVEL_STORE_SPANS_LONGER_THANconstant, set for the backwards compatibility also to 32. That being said, under normal conditions this adds no extra overhead in terms of the db operations when the chain is running.OverlayLevelto useBTreeSetfor searching for the first available index instead of the bit mask. This will be a little slower, but the db operations probably are the bottleneck anyway, so it shouldn't be much of a problem.Tested on
TestClient, which seemingly can import an arbitrary number of blocks on the same level.Checklist
Trequired)
You can remove the "Checklist" section once all have been checked. Thank you for your contribution!
✄ -----------------------------------------------------------------------------