-
Notifications
You must be signed in to change notification settings - Fork 563
Description
Using dragonboat v4.
Is there a way to exit from a StateMachine Update with an error without causing a panic? Is this something you would consider adding? It could be triggered by returning a special kind of error from the Update, for example.
As I understand, the assumption is that a state machine error always indicates a logic error, and we should not continue. I would like to argue that this is not the case for OnDiskStateMachines: in our project, we have multiple replicas per node host, and each replica's SM stores data on its own HDD. The HDDs can enter RO mode or fail unexpectedly. This is not a logic error but a partial hardware fault in this case. We would like to handle such cases gracefully by stopping the failed replica without affecting other replicas on this NodeHost.
We have to handle two scenarios of disk failures:
- RO->RW: FS enters RO mode, an op fixes it, and makes it RW again. In this case, the SM can continue applying updates after a long pause.
- Disk fails completely: Wait for a new replica to be created and become up to date, after this we can safely discard the failed replica.
The simplest way to handle these would be to just stop the replica without applying the last update.