-
Notifications
You must be signed in to change notification settings - Fork 185
Description
Background
The current batch membership change implementation has several limitations and semantic issues identified in PR #1351
Key Problems to Address
1. Retain Parameter Conflicts in Concurrent Changes
When multiple membership changes happen concurrently, their retain parameters can "pollute" each other, leading to unpredictable behavior where nodes may be unexpectedly retained or removed based on the order of operations.
Example:
Starting cluster membership: [{n1,n2,n3}]
Task 1: RemoveVoter(n1), retain = true
Task 2: RemoveVoter(n2), retain = false
Interleaving:
1. First Task1: [{n1,n2,n3}] → [{n1,n2,n3}, {n2,n3}]
2. Then Task2: [{n1,n2,n3}, {n2,n3}] → [{n2,n3}, {n3}]
- At this point, n1 is removed as a learner because Task2 has retain = false
3. Then either Task1 or Task2: [{n2,n3}, {n3}] → [{n3}]
- If Task1 runs first: n2 is kept as a learner
- If Task2 runs first: n2 is removed as a learner
Result: The retain behavior depends on execution order, not the original intent.
2. Mixing Voter and Node Changes
Current implementation doesn't properly handle simultaneous removal of voters and their corresponding nodes in the same batch. This can lead to constraint violations in ensure_voter_nodes.
Example:
Batch operation (with retain = true): RemoveVoters(n1), RemoveNode(n1)
Starting config: [{n3,n1}, {n2,n1}]
What happens:
1. RemoveVoters(n1) requires a joint config transition
2. RemoveNode(n1) tries to execute immediately
3. ERROR: violates ensure_voter_nodes constraint because n1 is still a voter in the config
The issue: Nodes can only be removed after the corresponding voter has been removed from the config.
3. Confusing Semantics
Learner changes happen immediately while voter changes go through joint configurations, making batch operations with mixed change types confusing and error-prone.
Example:
Batch operation: AddVoter(n4), AddNode(n5)
Starting config: [{n1,n2,n3}]
What happens:
1. AddVoter(n4): Creates joint config [{n1,n2,n3}, {n1,n2,n3,n4}] (staged change)
2. AddNode(n5): Immediately adds n5 to nodes map (immediate change)
Result:
- n4 is not available as a node until the joint config is flattened
- n5 is immediately available as a learner
- Mixed timing makes it hard to reason about the state
Proposed Solution
Implement explicit learner tracking in the membership structure:
Current structure:
struct Membership {
configs: Vec<BTreeSet<VoterId>>,
nodes: BTreeMap<NodeId, Node>,
}Proposed structure:
/// Defines the role capabilities of a node in the Raft cluster.
///
/// In standard Raft:
/// - A voter can start elections and vote in election, thus it has `Elect` and `Vote`
/// - A learner only receives logs from the leader and cannot vote or start elections, thus it has `AcceptLog`
///
/// This enum breaks down these capabilities into granular roles for more flexible node configurations.
///
/// | Functions | Vote Storage | Log Storage | Description |
/// | :--- | :--- | :--- | :--- |
/// | `Elect` | No | No | Can become a leader, without local storage |
/// | `Vote` | Yes | No | Participate in election voting, counts toward **read-quorum** |
/// | `LearnLog` | No | Yes | Receive log replication, A `Learner` in std Raft. Saving `Vote` is not required, but reduce confliction when becoming a `Elect` |
/// | `AcceptLog` | Yes | Yes | Receive log replication, counts toward **write-quorum** for commits. `AcceptLog` implies `LearnLog` |
/// | `Elect|Vote|AcceptLog` | Yes | Yes | A `Voter` in std Raft |
///
/// Note:
/// - `AcceptLog` always requires `Vote` storage to prevent receiving log from older Leader.
/// - Currently 2025-07-23, we do not support read/write quorum separation yet.
enum NodeFunction {
/// Node can initiate elections and become a leader.
///
/// All nodes, including learners, can potentially be elected as leader.
///
/// In special cases, a node with only `Elect` can elect itself as leader without writing logs to its local storage.
Elect,
/// Participate in election voting, counts toward **read-quorum**, without local storage
Vote,
/// Receive log replication, counts toward **write-quorum** for commits
AcceptLog,
}
/// Represents a cluster member with specific functions and node information.
///
/// A member combines the functional capabilities of a node (what it can do)
/// with the actual node data (how to reach it, metadata, etc.).
struct Member<C: RaftTypeConfig> {
/// Set of functions this member can perform in the cluster.
///
/// This determines whether the member can vote, accept logs, or initiate elections.
/// Multiple functions can be combined (e.g., a voter has both Vote and AcceptLog).
functions: BTreeSet<NodeFunction>,
/// The actual node information for this member.
///
/// Contains network address, metadata, and other node-specific data
/// as defined by the application's RaftTypeConfig.
node: C::Node,
}
/// Represents a single configuration step in the membership.
///
/// In standard Raft, this would be the set of voters at a given point in time.
struct Config<C: RaftTypeConfig> {
/// Map of all members participating in this configuration step.
///
/// The key is the NodeId, and the value contains both the member's
/// functional capabilities and connection information.
members: BTreeMap<NodeId, Member<C>>,
}
/// Defines the complete membership configuration for a Raft cluster.
///
/// The membership can contain one or more configuration steps:
/// - Single config: Standard membership with one set of members
/// - Joint config: Transition state with multiple configs during membership changes
///
/// During a joint configuration, operations require quorum from ALL configs,
/// making membership changes safer but potentially slower.
struct EnhancedMembership<C: RaftTypeConfig> {
/// One or more config entry.
///
/// With more than one Config, it is a **joint config**, which means,
/// A quorum of a joint config is a union of a quorum in each `Config`.
configs: Vec<Config<C>>,
}Benefits
- Better State Representation: Explicitly track learners for each config step
- Improved Batch Operations: Enable proper handling of mixed voter/learner changes
- Clearer Semantics: Remove ambiguity around immediate vs. staged changes
- Enhanced Flexibility: Support operations that weren't possible before
Implementation Considerations
- Must be implemented as backward-compatible change to avoid breaking existing applications
- Consider renaming
AddNodes/RemoveNodestoAddLearners/RemoveLearnersfor clarity - Review semantics of
SetNodesandReplaceAllNodesoperations
TODO:
-
Design
RaftMembershiptrait and updateRaftTypeConfig- Define trait with methods: quorum calculation, member lookup, vote/learner separation
- Add
type Membership: RaftMembershiptoRaftTypeConfig - Implement trait for existing
Membershipstruct - Update
RaftCoreto useC::Membership
-
Create
EnhancedMembershipwith new semantics- Implement granular node functions (Elect, Vote, AcceptLog)
- Fix retain parameter conflicts and batch operation issues