-
Notifications
You must be signed in to change notification settings - Fork 122
Consensus WAL may contain corrupted data #1434
Description
The first case to consider is that the WAL data is “completely” corrupted, as handled in 38f113f6c81da0af32a748718b2d87ab64e3a72ft . In this case, the consensus routine will hang indefinitely and operator intervention is required.
The second case to consider is when a process crashes while writing data to the WAL. All writes to the WAL are appends, and a failed append should not corrupt the remaining data (if so, we are in the first case). This means that, when replaying the WAL, we should tolerate the scenario where the last (in the case of only synchronous writes) or the last K (in the case of synchronous and asynchronous writes) WAL entries are corrupted. Those entries can just be ignored and the remaining of the WAL entries replayed.
While it is odd ignoring some WAL entries, because they are corrupted, the operation of the protocol guarantees that an action is performed only after the associated events and inputs are successfully persisted to the WAL. So, if the last entry of the WAL is corrupted, this means the write was not concluded with success, which means that the associated action was not performance. For example, the process receives a Proposal and FullValue and issues a Prevote for it. If the writing of the inputs, say the Proposal, fails, then the Prevote is not broadcast. Therefore, there is no equivocation if when restarting the process issues a Prevote for nil, for instance.
This issue was originally a ticket in Jira: https://circlepay.atlassian.net/browse/CCHAIN-771