Feature Request: Improved builtin backup engine

### Feature Description

I am creating this as a placeholder for discussion and to track different PRs that will be proposed to improve the speed of backups and restores using the `builtinbackupengine`. Some of those will be specific to S3, while others will apply to every storage.

**S3 restores are slow:**
- The current restore pipeline uses `io.Copy()` with layered readers/writers (and compressors), meaning all stages (network download, decompression, disk write) execute in lockstep within a single goroutine. If any stage stalls, the entire pipeline blocks, also reducing its speed.
- The files are processed basically single-threaded from S3, basically uploading/downloading one file (table) at a time. S3 has its own limitations, at about 80 MB/s per connection, which severely limits how fast we can go. This is probably okay in shards with a number of tables that are sized similarly (so we can download them in parallel), but if you have a shard were one table is much larger than others, it eventually becomes very slow. Also the decompressor has a variable speed, and when it's slower than the network, the TCP receive buffer fills, the kernel reduces the receive window, and throughput drops.
- Currently we use regular `pwrite` syscalls to copy data from the userpace into the kernel page cache. After some optimisations we did, we noticed that writing 1GB with 32 threads in parallel causes the page cache to fill rapidly and the `vttablet` spends a lot of CPU on those syscalls. The kernel's writeback becomes a bottleneck - it must flush dirty pages, evict them, and then the process must re-add new pages

**High level of proposals:**
- Replace `io.Copy()` with separate goroutines and have each one of the stages of the restore pipeline be connected via buffered channels. This allows the network reader to continue fetching data while the decompressor works, and the disk writer to flush independently.
- Change the `builtinbackupengine` to support a new option to split the backup in smaller chunks. This requires a change in the `MANIFEST` which now needs to track size and offset of each part for every file. Along with this, we can download these chunks in parallel, massively increasing throughput compared to the previous implementation.
- Use O_DIRECT (similar to how MySQL does it) to bypass the page cache and write directly to disk. This avoids page cache thrashing and avoids wasting time on syscalls.

On our internal testing, we saw massive improvement: restoring a 2TB tablet was ~4x faster (it went from 49m to 12m). It became way faster than `xtrabackup`, which has a similar behaviour when you have shards where the dataset is dominated by a single table.

The plan is to create separate PRs for each one of these improvements that explain the changes more in-depth so they can be reviewed more easily and they should be mostly independent from each other. From what we saw we should be able to make it compatible between versions as well (a newer version of Vitess will be able to restore a backup from a previous version always, and we can make the chunking happen with a parameter which can be disable if there is a desire to create a backup that can be loaded into a tablet running a version pre-optimisation).

### Use Case(s)

Make restores faster than they are today

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Improved builtin backup engine #20159

Feature Description

Use Case(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: Improved builtin backup engine #20159

Description

Feature Description

Use Case(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions