Remove block limit per level #2933

ggawryal · 2024-01-15T13:15:25Z

Description

In paritytech/substrate#8494, there was introduced a limit for the number of blocks with the same number, that could be stored in the database. If there are more than 32 such blocks, then StateDbError::TooManySiblingBlocks is raised.

This PR removes that limit, by making the state-db not enforcing any artificial limits on that number.

While, as noted in the pull request creating that limit, having that many validated blocks at the same level would be something very unusual, the limit itself is a hidden, unconfigurable assumption added to the substrate framework. This can be considered as some kind of risk, particularly taking into account the possible consequences of exceeding it. I think this fixed issue is somewhat an evidence paritytech/cumulus#1559.
To my knowledge, this is the only substrate component that requires any assumptions on that number, and moreover is used only for loading noncanonicalized journal from disk after the restart.

The changes are implemented by:

Keeping the span of the noncanonical overlay level, which is the largest index ever used on that level (highly inspired by the alternative solution mentioned in Fixed restoring state-db journals on startup substrate#8494). Later, this value is added to the commit, so that we know how many blocks should we expect when loading the journal from db. However, to avoid unnecessary operations to the db, it is only written when it is larger than the OVERLAY_LEVEL_STORE_SPANS_LONGER_THAN constant, set for the backwards compatibility also to 32. That being said, under normal conditions this adds no extra overhead in terms of the db operations when the chain is running.
Changing the OverlayLevel to use BTreeSet for searching for the first available index instead of the bit mask. This will be a little slower, but the db operations probably are the bottleneck anyway, so it shouldn't be much of a problem.

Tested on TestClient, which seemingly can import an arbitrary number of blocks on the same level.

Checklist

My PR follows the labeling requirements of this project (at minimum one label for T
required)

You can remove the "Checklist" section once all have been checked. Thank you for your contribution!

✄ -----------------------------------------------------------------------------

cla-bot-2021 · 2024-01-15T13:15:34Z

User @ggawryal, please sign the CLA here.

arkpar · 2024-01-16T10:30:02Z

The limit was introduced to prevent parachain collators from producing blocks at the same level if the relay chain is stuck for whatever reason.

This can be considered as some kind of risk, particularly taking into account the possible consequences of exceeding it.

Having a lot of sibling blocks is not something well supported and not covered by our tests. Allowing it is also a risk.

think this fixed issue is somewhat an evidence paritytech/cumulus#1559.

This PR was authored not to work around the error, but to enforce correct behviour.
The TooManySiblingBlocks error was introduced to detect this kind of pathological situations early. So I'd argue it forces the upstream code to be more well behaved.

@bkchr What do you think?

ggawryal · 2024-01-17T16:31:37Z

Thanks for clarifications, this makes more sense to me right now. I agree, that for parachains, where we don't have any clear upper bound on the number of sibling block in the worst case, removing this can be risky.

I'm thinking, though, that for the relay chain, or solo chains developed using substrate, it should be fine to not have any hard-coded limit. The difference is that in this case we have anti-equivocation logic implemented (not present in parachains if I'm not mistaken), so the number of blocks on the same level is bounded by the number of validators in active era kind of naturally, and allowing such a number of blocks imho shouldn't be much of a problem.

If we don't want to remove this limit, maybe we can move enforcing the limit only to parachain related codebase?

bkchr · 2024-01-20T09:12:41Z

Maybe you could start explaining your problem @ggawryal. Did you hit this 32 blocks limit and why?

I mean I would also like to remove this limit, but I know that the limit exists because the current implementation is probably not that well behaving if there is an unbounded number of blocks at the same height. So, this would require more involved changes to the lower layers than just removing the limit as you have done here.

ggawryal · 2024-01-23T09:04:39Z

I've been working on Aleph Zero blockchain, and we use the AlephBFT consensus there. Its main difference with GRANDPA from blockchain's POV is that the finalization in AlephBFT is not strongly related to the longest chain rule, and it can sometimes "choose" to finalize a block on some short branch. Therefore, we've adapted the block sync mechanism to use also request-response protocol in support of notification protocol (actually, we're rewriting block sync protocol to suit our consensus better, but the main idea with the request-response protocol in substrate is roughly the same).

Because of that, and taking into account also shorter block time (1s), I'm slightly concerned about hitting the sibling limit at some point in time. Currently, this is only a theoretical consideration, as the largest number of sibling blocks we had on the chain was around 6 IIRC. However, as we plan to decentralize the network more, and increase the number of validators per session, this limit would be easier to reach. Of course, accidentally reaching the limit itself wouldn't be the cause of a problem, but rather a point in a sequence of failures occurring after some event, at which bringing the blockchain back to the correct state would be quite annoying.

Ideally, we'd like to remove that limit totally for Aleph Zero blockchain, or increase the limit to be at least the number of validators per session. Workarounds, like pruning some leaves from the db, are not an option, as most nodes can't predict well which blocks are the least likely to be finalized. Also, even if we were able to predict it somehow, any block sync mechanism using a block request protocol from substrate would be problematic in such a case - a single malicious party could use it to request any block we've dropped. This problem can be easily missed, as the sibling limit is documented only in the documentation of state-db.

bkchr · 2024-02-27T23:09:06Z

@ggawryal could you please either do some experiments with your pr and a high number of forks at the same height or make the parameter configurable?

paritytech-cicd-pr · 2024-03-04T15:32:27Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/5428258

ggawryal · 2024-03-04T15:32:56Z

Added some basic experiments, importing a bunch of forks seems to happen as fast on TestClient as importing a chain of the same number of blocks on top of each other, at least on my machine (see tests in service/test/src/client).
As I wasn't sure how to prevent parachain collators from producing too many blocks, I added also a flag to the config which enables/disables this limit, as you've suggested.

ggawryal · 2024-04-08T12:05:08Z

@bkchr @arkpar Could you please suggest anything else what could be done with this PR, or maybe review it assuming the block limit is enabled by default?

ggawryal · 2024-06-24T08:58:58Z

@bkchr @arkpar gentle ping

bkchr

Yeah we can go ahead with this one. However, I would not make it configurable. So, please revert the changes in config.rs etc.

ggawryal · 2024-07-08T07:31:12Z

I've reverted the changes - you can take a look now

ggawryal · 2024-08-26T08:16:57Z

@bkchr @arkpar any update please on this PR?

ggawryal · 2024-10-07T08:23:18Z

@bkchr @arkpar gentle ping

arkpar · 2024-10-10T12:49:07Z

substrate/client/state-db/src/noncanonical.rs

-				for index in 0..MAX_BLOCKS_PER_LEVEL {
+				let level_span = db
+					.get_meta(&to_meta_key(OVERLAY_LEVEL_SPAN, &block))
+					.map_err(Error::Db)?


Wouldn't that break compatibility with existing databases taht don't have the OVERLAY_LEVEL_SPAN key?

I think MetaDb::get_meta in case of a missing key returns Ok(None), right? If so, this will default to iterating through indices 0..OVERLAY_LEVEL_STORE_SPANS_LONGER_THAN, which is the same as the old behavior.

ggawryal · 2024-10-28T08:17:18Z

@bkchr can you please add another review?

ggawryal · 2024-11-25T14:21:36Z

@bkchr gentle ping

ggawryal · 2025-01-21T08:06:43Z

@bkchr gentle ping

ggawryal · 2025-02-05T14:10:25Z

@bkchr could you or maybe someone else review please?

Grzegorz Gawryał and others added 8 commits January 15, 2024 12:08

Keep overlay level length

192a381

Remove journal level length db entries

9a1eafa

Save journal length only above some threshold

885eeb5

Tests

4e95fd3

Replace indices bit mask

0fd7ea6

Comments

39fb3c8

Update doc

fa25898

Merge branch 'paritytech:master' into ggawryal-block-limit-per-level

19aaaa9

ggawryal marked this pull request as ready for review January 15, 2024 13:22

bkchr requested a review from arkpar January 15, 2024 14:58

arkpar requested a review from bkchr January 16, 2024 10:30

ggawryal and others added 6 commits March 1, 2024 12:25

Merge branch 'paritytech:master' into ggawryal-block-limit-per-level

7190246

Add config to disable block limit

1fd205d

Rollback max blocks per level const

cc9e68f

Block limit enabled test

36743e5

Fmt

78f8efc

Merge branch 'master' into ggawryal-block-limit-per-level

aa45dd5

ggawryal mentioned this pull request May 21, 2024

A0-3582: Deal with block limit aleph-zero-foundation/polkadot-sdk#1

Closed

bkchr reviewed Jun 24, 2024

View reviewed changes

Merge with upstream

40d608a

Grzegorz Gawryał added 2 commits June 25, 2024 11:20

Remove block limit per level from config

f68e275

Fmt

61ac47f

ggawryal requested a review from bkchr June 26, 2024 07:56

ggawryal added 2 commits June 28, 2024 09:27

Merge branch 'master' into ggawryal-block-limit-per-level

5630f83

Merge branch 'master' into ggawryal-block-limit-per-level

03e5bdc

Merge branch 'master' into ggawryal-block-limit-per-level

ef6555d

arkpar reviewed Oct 10, 2024

View reviewed changes

arkpar approved these changes Oct 16, 2024

View reviewed changes

ggawryal closed this Jun 30, 2025

Remove block limit per level #2933

Remove block limit per level #2933

Uh oh!

Conversation

ggawryal commented Jan 15, 2024

Description

Checklist

Uh oh!

cla-bot-2021 bot commented Jan 15, 2024

Uh oh!

arkpar commented Jan 16, 2024

Uh oh!

ggawryal commented Jan 17, 2024

Uh oh!

bkchr commented Jan 20, 2024

Uh oh!

ggawryal commented Jan 23, 2024

Uh oh!

bkchr commented Feb 27, 2024

Uh oh!

paritytech-cicd-pr commented Mar 4, 2024

Uh oh!

ggawryal commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggawryal commented Apr 8, 2024

Uh oh!

ggawryal commented Jun 24, 2024

Uh oh!

bkchr left a comment

Choose a reason for hiding this comment

Uh oh!

ggawryal commented Jul 8, 2024

Uh oh!

ggawryal commented Aug 26, 2024

Uh oh!

ggawryal commented Oct 7, 2024

Uh oh!

arkpar Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

ggawryal Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggawryal commented Oct 28, 2024

Uh oh!

ggawryal commented Nov 25, 2024

Uh oh!

ggawryal commented Jan 21, 2025

Uh oh!

ggawryal commented Feb 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggawryal commented Mar 4, 2024 •

edited

Loading

ggawryal Oct 16, 2024 •

edited

Loading