-
Notifications
You must be signed in to change notification settings - Fork 62
[otap-df-quiver] Add Write-Ahead-Log (WAL) implementation to Quiver #1537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| let entry_len = u32::from_le_bytes(len_buf) as usize; | ||
| self.buffer.resize(entry_len, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the reader is trusting the length got from the file, and doing the allocation. In case of the WAL file is corrupted/or malicious attack, the length is large enough ( say 0xFFFF..), the reader will try allocate that size. While the 4 bytes will limit the allocation to 4GB, all df_instance doing this allocation can result in OOM crash. Should we have some kind of max limit check (say WAL size won't be more than 64MB) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. The default rotation target size for a single file is 64MB. I added an upper bound of 256MB for the rotation target file size which is also used by the reader as an upper bound for an entry size.
Additionally, when we are reading the file, I added validation to ensure entry_len doesn't exceed the remaining file length to guard against this corruption/attack scenario.
| path, | ||
| segment_cfg_hash, | ||
| flush_policy, | ||
| max_wal_size: u64::MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a question - what will happen if we have reached the limit of 8 wal rotated files of 64MB each, while this max_wal_size limit is still not reached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two different size enforcement mechanisms:
rotation_target_bytesxmax_rotated_filesmax_wal_size
If we reach the limit for (1), then we return a WalAtCapacity error and (2) is not enforced. (1) is for controlling rotation behavior and limiting the amount of file descriptors used. (2) is for global size enforcement. (We could potentially simplify and eliminate (2), but will leave in for now.)
See tests: wal_writer_errors_when_rotated_file_cap_reached and wal_writer_preflight_rejects_when_rotated_file_cap_hit.
| } | ||
| } | ||
| Ok(highest.map_or(0, |seq| seq.wrapping_add(1))) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detect_next_sequence() scans all entries in all WAL files on startup. For the default ( 64MB * 8 = 512MB ) capacity this is fine, but would it make sense to persist the last sequence in the checkpoint sidecar for faster recovery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better yet, detect_next_sequence() doesn't need to scan all the rotated WAL files. It should only be looking at the active file (or the single most recent rotated file if the active file doesn't exist). Measuring this locally and time is on the order of ~30ms with a 64MB file, so doesn't seem worth putting in the sidecar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measuring this locally and time is on the order of ~30ms with a 64MB file, so doesn't seem worth putting in the sidecar.
Agreed on 30ms for typical cases, though edge devices with slower storage could see higher times. For now, should be good to scan only the active file starting from the checkpoint offset (as we read sidecar before wal) rather than from the beginning. We can revisit persisting the sequence number if startup time becomes an issue in production.
…tive_file()/trim_partial_entries()
…izes against remaining file bytes before allocation
…ious/corrupted length values when reading
albertlockett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @AaronRM !
This pull request introduces an implementation of the Quiver write-ahead log (WAL). The most significant change from the initial spec includes a rewrite and clarification of the WAL file rotation and checkpointing mechanism. Documentation has been updated to reflect the new design.