Skip to content

Conversation

@ggawryal
Copy link

Description

In paritytech/substrate#8494, there was introduced a limit for the number of blocks with the same number, that could be stored in the database. If there are more than 32 such blocks, then StateDbError::TooManySiblingBlocks is raised.

This PR removes that limit, by making the state-db not enforcing any artificial limits on that number.

While, as noted in the pull request creating that limit, having that many validated blocks at the same level would be something very unusual, the limit itself is a hidden, unconfigurable assumption added to the substrate framework. This can be considered as some kind of risk, particularly taking into account the possible consequences of exceeding it. I think this fixed issue is somewhat an evidence paritytech/cumulus#1559.
To my knowledge, this is the only substrate component that requires any assumptions on that number, and moreover is used only for loading noncanonicalized journal from disk after the restart.

The changes are implemented by:

  1. Keeping the span of the noncanonical overlay level, which is the largest index ever used on that level (highly inspired by the alternative solution mentioned in Fixed restoring state-db journals on startup substrate#8494). Later, this value is added to the commit, so that we know how many blocks should we expect when loading the journal from db. However, to avoid unnecessary operations to the db, it is only written when it is larger than the OVERLAY_LEVEL_STORE_SPANS_LONGER_THAN constant, set for the backwards compatibility also to 32. That being said, under normal conditions this adds no extra overhead in terms of the db operations when the chain is running.
  2. Changing the OverlayLevel to use BTreeSet for searching for the first available index instead of the bit mask. This will be a little slower, but the db operations probably are the bottleneck anyway, so it shouldn't be much of a problem.

Tested on TestClient, which seemingly can import an arbitrary number of blocks on the same level.

Checklist

You can remove the "Checklist" section once all have been checked. Thank you for your contribution!

✄ -----------------------------------------------------------------------------

@cla-bot-2021
Copy link

cla-bot-2021 bot commented Jan 15, 2024

User @ggawryal, please sign the CLA here.

@ggawryal ggawryal marked this pull request as ready for review January 15, 2024 13:22
@bkchr bkchr requested a review from arkpar January 15, 2024 14:58
@arkpar
Copy link
Member

arkpar commented Jan 16, 2024

The limit was introduced to prevent parachain collators from producing blocks at the same level if the relay chain is stuck for whatever reason.

This can be considered as some kind of risk, particularly taking into account the possible consequences of exceeding it.

Having a lot of sibling blocks is not something well supported and not covered by our tests. Allowing it is also a risk.

think this fixed issue is somewhat an evidence paritytech/cumulus#1559.

This PR was authored not to work around the error, but to enforce correct behviour.
The TooManySiblingBlocks error was introduced to detect this kind of pathological situations early. So I'd argue it forces the upstream code to be more well behaved.

@bkchr What do you think?

@arkpar arkpar requested a review from bkchr January 16, 2024 10:30
@ggawryal
Copy link
Author

Thanks for clarifications, this makes more sense to me right now. I agree, that for parachains, where we don't have any clear upper bound on the number of sibling block in the worst case, removing this can be risky.

I'm thinking, though, that for the relay chain, or solo chains developed using substrate, it should be fine to not have any hard-coded limit. The difference is that in this case we have anti-equivocation logic implemented (not present in parachains if I'm not mistaken), so the number of blocks on the same level is bounded by the number of validators in active era kind of naturally, and allowing such a number of blocks imho shouldn't be much of a problem.

If we don't want to remove this limit, maybe we can move enforcing the limit only to parachain related codebase?

@bkchr
Copy link
Member

bkchr commented Jan 20, 2024

Maybe you could start explaining your problem @ggawryal. Did you hit this 32 blocks limit and why?

I mean I would also like to remove this limit, but I know that the limit exists because the current implementation is probably not that well behaving if there is an unbounded number of blocks at the same height. So, this would require more involved changes to the lower layers than just removing the limit as you have done here.

@ggawryal
Copy link
Author

I've been working on Aleph Zero blockchain, and we use the AlephBFT consensus there. Its main difference with GRANDPA from blockchain's POV is that the finalization in AlephBFT is not strongly related to the longest chain rule, and it can sometimes "choose" to finalize a block on some short branch. Therefore, we've adapted the block sync mechanism to use also request-response protocol in support of notification protocol (actually, we're rewriting block sync protocol to suit our consensus better, but the main idea with the request-response protocol in substrate is roughly the same).

Because of that, and taking into account also shorter block time (1s), I'm slightly concerned about hitting the sibling limit at some point in time. Currently, this is only a theoretical consideration, as the largest number of sibling blocks we had on the chain was around 6 IIRC. However, as we plan to decentralize the network more, and increase the number of validators per session, this limit would be easier to reach. Of course, accidentally reaching the limit itself wouldn't be the cause of a problem, but rather a point in a sequence of failures occurring after some event, at which bringing the blockchain back to the correct state would be quite annoying.

Ideally, we'd like to remove that limit totally for Aleph Zero blockchain, or increase the limit to be at least the number of validators per session. Workarounds, like pruning some leaves from the db, are not an option, as most nodes can't predict well which blocks are the least likely to be finalized. Also, even if we were able to predict it somehow, any block sync mechanism using a block request protocol from substrate would be problematic in such a case - a single malicious party could use it to request any block we've dropped. This problem can be easily missed, as the sibling limit is documented only in the documentation of state-db.

@bkchr
Copy link
Member

bkchr commented Feb 27, 2024

@ggawryal could you please either do some experiments with your pr and a high number of forks at the same height or make the parameter configurable?

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/5428258

@ggawryal
Copy link
Author

ggawryal commented Mar 4, 2024

Added some basic experiments, importing a bunch of forks seems to happen as fast on TestClient as importing a chain of the same number of blocks on top of each other, at least on my machine (see tests in service/test/src/client).
As I wasn't sure how to prevent parachain collators from producing too many blocks, I added also a flag to the config which enables/disables this limit, as you've suggested.

@ggawryal
Copy link
Author

ggawryal commented Apr 8, 2024

@bkchr @arkpar Could you please suggest anything else what could be done with this PR, or maybe review it assuming the block limit is enabled by default?

@ggawryal
Copy link
Author

@bkchr @arkpar gentle ping

Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can go ahead with this one. However, I would not make it configurable. So, please revert the changes in config.rs etc.

@ggawryal ggawryal requested a review from bkchr June 26, 2024 07:56
@ggawryal
Copy link
Author

ggawryal commented Jul 8, 2024

I've reverted the changes - you can take a look now

@ggawryal
Copy link
Author

@bkchr @arkpar any update please on this PR?

@ggawryal
Copy link
Author

ggawryal commented Oct 7, 2024

@bkchr @arkpar gentle ping

for index in 0..MAX_BLOCKS_PER_LEVEL {
let level_span = db
.get_meta(&to_meta_key(OVERLAY_LEVEL_SPAN, &block))
.map_err(Error::Db)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that break compatibility with existing databases taht don't have the OVERLAY_LEVEL_SPAN key?

Copy link
Author

@ggawryal ggawryal Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MetaDb::get_meta in case of a missing key returns Ok(None), right? If so, this will default to iterating through indices 0..OVERLAY_LEVEL_STORE_SPANS_LONGER_THAN, which is the same as the old behavior.

@ggawryal
Copy link
Author

@bkchr can you please add another review?

@ggawryal
Copy link
Author

@bkchr gentle ping

@ggawryal
Copy link
Author

@bkchr gentle ping

@ggawryal
Copy link
Author

ggawryal commented Feb 5, 2025

@bkchr could you or maybe someone else review please?

@ggawryal ggawryal closed this Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants