Skip to content

Consensus WAL may contain corrupted data #1434

@cason

Description

@cason

The first case to consider is that the WAL data is “completely” corrupted, as handled in 38f113f6c81da0af32a748718b2d87ab64e3a72ft . In this case, the consensus routine will hang indefinitely and operator intervention is required.

The second case to consider is when a process crashes while writing data to the WAL. All writes to the WAL are appends, and a failed append should not corrupt the remaining data (if so, we are in the first case). This means that, when replaying the WAL, we should tolerate the scenario where the last (in the case of only synchronous writes) or the last K (in the case of synchronous and asynchronous writes) WAL entries are corrupted. Those entries can just be ignored and the remaining of the WAL entries replayed.

While it is odd ignoring some WAL entries, because they are corrupted, the operation of the protocol guarantees that an action is performed only after the associated events and inputs are successfully persisted to the WAL. So, if the last entry of the WAL is corrupted, this means the write was not concluded with success, which means that the associated action was not performance. For example, the process receives a Proposal and FullValue and issues a Prevote for it. If the writing of the inputs, say the Proposal, fails, then the Prevote is not broadcast. Therefore, there is no equivocation if when restarting the process issues a Prevote for nil, for instance.

This issue was originally a ticket in Jira: https://circlepay.atlassian.net/browse/CCHAIN-771

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcoreRelated to the core consensus implementationwal

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions