Skip to content

chunked builtin backup engine#20167

Open
rvrangel wants to merge 5 commits into
vitessio:mainfrom
rvrangel:builtin-backup-chunking
Open

chunked builtin backup engine#20167
rvrangel wants to merge 5 commits into
vitessio:mainfrom
rvrangel:builtin-backup-chunking

Conversation

@rvrangel
Copy link
Copy Markdown
Contributor

@rvrangel rvrangel commented May 22, 2026

Description

This is the first PR as part of #20159

This PR adds chunked parallel backup/restore to the builtin backup engine. Files larger than a configurable threshold are split into independently-compressed chunks during backup, which can then be restored in parallel using writes at known offsets.

Changes:

  • Two new flags: --builtinbackup-file-chunk-threshold (default 0, disabled) and --builtinbackup-file-chunk-size (default 1GiB)
  • During backup, files exceeding the threshold are split into chunks, each stored as a separate object in backup storage
  • During restore, chunks of the same file are written concurrently via offsetWriter (pwrite semantics)
  • Failed chunks are retried using the same mechanism as whole-file retries
  • Backward compatible: threshold=0 disables chunking, and old manifests (no Chunks field) restore identically to before

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

PR created by me with support of Claude, fully tested by me before publishing on our own branch and tested with unit tests and e2e on the main branch

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Copilot AI review requested due to automatic review settings May 22, 2026 18:28
@github-actions github-actions Bot added this to the v25.0.0 milestone May 22, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 22, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 22, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional chunking to the builtinbackupengine so that large MySQL files can be backed up and restored as independently-compressed pieces, enabling higher parallelism (especially beneficial for object stores like S3) and improving restore throughput.

Changes:

  • Introduces chunk metadata in the backup manifest (FileEntry.Chunks) and new flags to control chunking (--builtinbackup-file-chunk-threshold, --builtinbackup-file-chunk-size).
  • Updates builtin backup/restore to split large files into chunks for parallel backup and to restore chunked files via parallel WriteAt (pwrite-style) writes into a pre-sized destination.
  • Adds unit and end-to-end tests validating chunk name parsing and verifying chunked vs non-chunked backups via MANIFEST inspection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
go/vt/mysqlctl/file_close_test.go Updates tests for the new backupFile(..., chunkIndex) signature.
go/vt/mysqlctl/builtinbackupengine.go Core implementation: chunking flags, manifest schema, chunked backup work scheduling, and parallel chunk restore.
go/vt/mysqlctl/builtinbackupengine_test.go Adds unit tests for parsing storage names (parseBackupName).
go/test/endtoend/backup/vtctlbackup/backup_utils.go Adds helpers to verify chunking by reading MANIFEST and counting chunks.
go/test/endtoend/backup/vtctlbackup/backup_test.go Adds end-to-end tests for chunked and non-chunked builtin backups with forced small thresholds/sizes.
go/flags/endtoend/vttestserver.txt Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vttablet.txt Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtctld.txt Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtcombo.txt Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtbackup.txt Documents new builtinbackup chunking flags in end-to-end flag snapshots.
Comments suppressed due to low confidence (2)

go/vt/mysqlctl/builtinbackupengine.go:1372

  • Chunk restore goroutines close over loop variables j and fe (and use dest/fe.Name inside the closure). This can result in writing the wrong chunk offset/data and misreporting errors/logs. Rebind the loop variables (e.g. j := j, feLocal := fe) before starting each goroutine.
			for j := range fe.Chunks {
				g.Go(func() error {
					chunk := &fe.Chunks[j]

					select {

go/vt/mysqlctl/builtinbackupengine.go:1394

  • Non-chunked restore goroutine closes over i/fe from the enclosing for-loop. This can cause it to restore the wrong file index and log/record errors under the wrong name. Capture locals (e.g. iLocal := i, feLocal := fe) before g.Go.
			// Non-chunked file: restore as before.
			g.Go(func() error {
				name := strconv.Itoa(i)

				select {

Comment thread go/vt/mysqlctl/builtinbackupengine.go Outdated
Comment on lines +663 to +667
if backupFileChunkThreshold > 0 && fileSize > backupFileChunkThreshold {
numChunks := (fileSize + backupFileChunkSize - 1) / backupFileChunkSize
fe.Chunks = make([]FileChunk, numChunks)
for j := range numChunks {
offset := j * backupFileChunkSize
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does compile, this sounds like advice for an older Go version?

Comment on lines +776 to +790
for _, wi := range workItems {
g.Go(func() error {
fe := &fes[i]
name := strconv.Itoa(i)
fe := &fes[wi.feIndex]

// Check for context cancellation explicitly because, the way semaphore code is written, theoretically we might
// end up not throwing an error even after cancellation. Please see https://cs.opensource.google/go/x/sync/+/refs/tags/v0.1.0:semaphore/semaphore.go;l=66,
// which suggests that if the context is already done, `Acquire()` may still succeed without blocking. This introduces
// unpredictability in my test cases, so in order to avoid that, I am adding this cancellation check.
select {
// Skip work if the context has been cancelled (e.g. another goroutine failed).
case <-ctxCancel.Done():
log.Error(fmt.Sprintf("Context canceled or timed out during %q backup", fe.Name))
bh.RecordError(name, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
bh.RecordError(wi.name, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
return nil
default:
}

// Backup the individual file.
var errBackupFile error
if errBackupFile = be.backupFile(ctxCancel, params, bh, fe, name); errBackupFile != nil {
bh.RecordError(name, vterrors.Wrapf(errBackupFile, "failed to backup file '%s'", name))
if errBackupFile := be.backupFile(ctxCancel, params, bh, fe, wi.name, wi.chunkIndex); errBackupFile != nil {
bh.RecordError(wi.name, vterrors.Wrapf(errBackupFile, "failed to backup '%s'", wi.name))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this seems pre 1.22: https://go.dev/doc/go1.22#language

Comment on lines 1291 to 1310
if files := bh.GetFailedFiles(); len(files) > 0 {
newFEs := make([]FileEntry, len(fes))
for _, file := range files {
fileNb, err := strconv.Atoi(file)
if err != nil {
return "", vterrors.Wrapf(err, "failed to retry file '%s'", file)
feIdx, chunkIdx, parseErr := parseBackupName(file)
if parseErr != nil {
return "", parseErr
}
oldFes := fes[fileNb]
newFEs[fileNb] = FileEntry{
Base: oldFes.Base,
Name: oldFes.Name,
ParentPath: oldFes.ParentPath,
Hash: oldFes.Hash,
RetryCount: 1,
oldFe := fes[feIdx]
if newFEs[feIdx].Name == "" {
newFEs[feIdx] = FileEntry{
Base: oldFe.Base,
Name: oldFe.Name,
ParentPath: oldFe.ParentPath,
Hash: oldFe.Hash,
RetryCount: 1,
}
}
if chunkIdx >= 0 {
newFEs[feIdx].Chunks = append(newFEs[feIdx].Chunks, oldFe.Chunks[chunkIdx])
}
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
@rvrangel rvrangel marked this pull request as ready for review May 25, 2026 13:06
@rvrangel rvrangel requested a review from mattlord as a code owner May 25, 2026 13:06
Copilot AI review requested due to automatic review settings May 25, 2026 13:06
@rvrangel rvrangel requested a review from frouioui as a code owner May 25, 2026 13:06
@promptless
Copy link
Copy Markdown
Contributor

promptless Bot commented May 25, 2026

Promptless prepared a documentation update related to this change.

Triggered by PR #20167

Added documentation for the new --builtinbackup-file-chunk-threshold and --builtinbackup-file-chunk-size flags to the backup and restore overview guide. These flags enable parallel backup and restore of large files by splitting them into independently-compressed chunks.

Review: Document builtin backup chunking flags

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment on lines +1390 to +1397
fullPath, pathErr := fe.fullPath(params.Cnf)
if pathErr != nil {
return vterrors.Wrapf(pathErr, "can't get path for chunked file %v", fe.Name)
}
dest, openErr := os.OpenFile(fullPath, os.O_WRONLY, 0o644)
if openErr != nil {
return vterrors.Wrapf(openErr, "can't open destination for chunked file %v", fe.Name)
}
@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

❌ Patch coverage is 77.95276% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.78%. Comparing base (70c7a72) to head (22620c5).
⚠️ Report is 277 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/mysqlctl/builtinbackupengine.go 77.95% 56 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (22620c5). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (70c7a72) HEAD (22620c5)
1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20167       +/-   ##
===========================================
- Coverage   69.67%   52.78%   -16.89%     
===========================================
  Files        1614       46     -1568     
  Lines      216793     7290   -209503     
===========================================
- Hits       151044     3848   -147196     
+ Misses      65749     3442    -62307     
Flag Coverage Δ
partial 52.78% <77.95%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
@frouioui frouioui added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Backup and Restore and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 25, 2026
Comment on lines +179 to +184
type FileChunk struct {
StorageName string
Offset int64
Size int64
Hash string
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comments for each field, similar to the FileEntry struct

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments added :)

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Copilot AI review requested due to automatic review settings May 26, 2026 15:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Comment on lines +196 to +212
func computeFileChunks(fileIndex int, fileSize, chunkSize int64) []FileChunk {
numChunks := (fileSize + chunkSize - 1) / chunkSize
chunks := make([]FileChunk, numChunks)
for j := range numChunks {
offset := j * chunkSize
size := chunkSize
if offset+size > fileSize {
size = fileSize - offset
}
chunks[j] = FileChunk{
StorageName: fmt.Sprintf("%d-%d", fileIndex, j),
Offset: offset,
Size: size,
}
}
return chunks
}
Comment on lines +790 to +804
for _, wi := range workItems {
g.Go(func() error {
fe := &fes[i]
name := strconv.Itoa(i)
fe := &fes[wi.feIndex]

// Check for context cancellation explicitly because, the way semaphore code is written, theoretically we might
// end up not throwing an error even after cancellation. Please see https://cs.opensource.google/go/x/sync/+/refs/tags/v0.1.0:semaphore/semaphore.go;l=66,
// which suggests that if the context is already done, `Acquire()` may still succeed without blocking. This introduces
// unpredictability in my test cases, so in order to avoid that, I am adding this cancellation check.
select {
// Skip work if the context has been cancelled (e.g. another goroutine failed).
case <-ctxCancel.Done():
log.Error(fmt.Sprintf("Context canceled or timed out during %q backup", fe.Name))
bh.RecordError(name, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
bh.RecordError(wi.name, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
return nil
default:
}

// Backup the individual file.
var errBackupFile error
if errBackupFile = be.backupFile(ctxCancel, params, bh, fe, name); errBackupFile != nil {
bh.RecordError(name, vterrors.Wrapf(errBackupFile, "failed to backup file '%s'", name))
if errBackupFile := be.backupFile(ctxCancel, params, bh, fe, wi.name, wi.chunkIndex); errBackupFile != nil {
bh.RecordError(wi.name, vterrors.Wrapf(errBackupFile, "failed to backup '%s'", wi.name))
Comment on lines +1414 to +1423
for j := range fe.Chunks {
g.Go(func() error {
chunk := &fe.Chunks[j]

select {
// Skip work if the context has been cancelled (e.g. another goroutine failed).
case <-ctx.Done():
log.Error(fmt.Sprintf("Context canceled or timed out during %q chunk %d restore", fe.Name, j))
bh.RecordError(chunk.StorageName, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
return nil
Comment thread go/vt/mysqlctl/builtinbackupengine.go Outdated
Comment on lines +281 to +283
if backupFileChunkSize <= 0 {
return BackupUnusable, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "builtinbackup-file-chunk-size can't be zero")
}
Comment on lines +1499 to +1509
cleanup := func() error {
params.Logger.Infof("closing decompressor")
closeAt := time.Now()
cerr := closeWithRetry(ctx, params.Logger, closer, "decompressor")
if cerr != nil {
cerr = vterrors.Wrapf(cerr, "failed to close decompressor %v", name)
params.Logger.Error(cerr)
}
params.Stats.Scope(stats.Operation("Decompressor:Close")).TimedIncrement(time.Since(closeAt))
return cerr
}
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go/vt/mysqlctl/builtinbackupengine.go:1386-1390 ignores final close errors for chunked restore destinations. Non-chunked restore propagates destination close failures because they can mean data was not safely flushed; chunked restore only logs them and can report success, then later attempt to start MySQL on an incomplete/corrupt file. Please collect these close errors and return them, and also check the dest.Close() in createChunkedDestinations at line 1367.

go/vt/mysqlctl/builtinbackupengine.go:196-198 / :691-694 has no bound on the chunk count. A small --builtinbackup-file-chunk-size typo, e.g. 1, can allocate one FileChunk and one work item per byte of a large InnoDB file before backup starts. Please enforce a sane minimum chunk size or a max chunks-per-file limit before allocating.

go/vt/mysqlctl/builtinbackupengine.go:281-283 validates chunk size even when chunking is disabled or when taking an incremental backup that may not use chunking. I’d validate chunk-size > 0 only when chunk-threshold > 0, reject negative thresholds explicitly, and fix the message to say must be > 0.

I agree with the compatibility caveat too: once chunking is enabled, those backups are not restorable by older Vitess versions because old restore code ignores Chunks and looks for whole-file objects. That should be called out in release notes summary.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
@rvrangel
Copy link
Copy Markdown
Contributor Author

thanks @mattlord, I have update the PR! let me know if you spot any other issues or want me to make any changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Backup and Restore NeedsWebsiteDocsUpdate What it says Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants