Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Conversation

@arkpar
Copy link
Member

@arkpar arkpar commented Mar 30, 2021

Currently when a block N is canonicalized, all discarded branches are removed from the canonicalization overlay immediately (unless pinned). They are also removed from the DB. The may lead to a situation, when out of two journals, for block N+1 and N+1', the former is discarded and the latter remains in the DB.

On startup the journals are loaded by trying DB keys for each block number, increasing the journal sub-index starting from 0. So if sub-index 0 is missing and sub-index 1 is there, it won't be loaded.

On kusama/polkadot this has low chance of happening, because forks of more than 1 block in length are rare and the node needs to be stopped/restarted on such a fork for the issue to happen. On rococo forks are more likely due to increased finality window and there are a lot of restart because of staling issues, which help finding this issue.

I've considered a fix that keeps track of journal counts for each block number, or involves organising journals in a linked list. Ultimately this requires updating an extra database key for each commit. So I've opted for a solution that tries a limited number of possible journals on startup. This very slightly increases startup time, but requires no extra writes when running.

@arkpar arkpar added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C3-medium PR touches the given topic and has a medium impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Mar 30, 2021
@cheme
Copy link
Contributor

cheme commented Mar 31, 2021

The code change looks good to me.
The overhead on launch indeed doesn't seems like an issue.
I am more wondering about the new limit of 32 simultaneous fork at a given block height (is it 'D3-trivial'?).
Maybe not an issue with babe and stalled finalization (2^5), but it limits substrate possible use-cases (it may already be).

@arkpar
Copy link
Member Author

arkpar commented Mar 31, 2021

I've considered this. The limit can't be used as an attack vector, because only valid blocks are imported into the DB. And with BABE/VRF it is astronomically unlikely that 32 valid blocks may be produced at the same height. state-db overlay is not really designed to push too many blocks into it, as it keeps everything in memory. So if an alternative consensus engine expects to have 32 forks for a block height, it will probably have to be redesigned anyway. Overall I don't think there will be an issue with any real-world chain.

@cheme
Copy link
Contributor

cheme commented Mar 31, 2021

I was not thinking of babe producing 32 blocks at once (astronomical indeed), but babe producing five time 2 block over a period without finalization (finalization stalling for a quiet long time).
Edit: babe probably do some fork choice so 5 is indeed big, but those I don't really know if those choice are enforced by consensus (think not). Anyway, seems very far fetched indeed.

Copy link
Contributor

@cheme cheme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, maybe the limit to 32 fork could be written in top level documentation (or configurable).

//! Database engine uses a notion of canonicality, rather then finality. A canonical block may not be yet finalized
//! from the perspective of the consensus engine, but it still can't be reverted in the database. Most of the time
//! during normal operation last canonical block is the same as lst finalized. However if finality stall for a
//! long duration for some reason, there's only a certain number of blocks that can fit in the non-canonical overlay,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we actually could not re-org on long non-finalized forks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, If the fork is longer than 4096 blocks. That's why we recommend running with archive mode on validators.

index += 1;
total += 1;
},
None => break,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, so this was the bug? There was some gap and we did not loaded the block(s) after the gap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

Co-authored-by: Bastian Köcher <[email protected]>
@arkpar arkpar merged commit 2ab715f into master Apr 3, 2021
@arkpar arkpar deleted the a-fix-state-db branch April 3, 2021 20:49
arkpar added a commit that referenced this pull request Apr 5, 2021
* Fixed restoring state-db journals on startup

* Improved documentation a bit

* Update client/state-db/src/lib.rs

Co-authored-by: Bastian Köcher <[email protected]>

Co-authored-by: Bastian Köcher <[email protected]>
hirschenberger pushed a commit to hirschenberger/substrate that referenced this pull request Apr 14, 2021
* Fixed restoring state-db journals on startup

* Improved documentation a bit

* Update client/state-db/src/lib.rs

Co-authored-by: Bastian Köcher <[email protected]>

Co-authored-by: Bastian Köcher <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C3-medium PR touches the given topic and has a medium impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants