chunked builtin backup engine by rvrangel · Pull Request #20167 · vitessio/vitess

rvrangel · 2026-05-22T18:28:24Z

Description

This is the first PR as part of #20159

This PR adds chunked parallel backup/restore to the builtin backup engine. Files larger than a configurable threshold are split into independently-compressed chunks during backup, which can then be restored in parallel using writes at known offsets.

Changes:

Two new flags: --builtinbackup-file-chunk-threshold (default 0, disabled) and --builtinbackup-file-chunk-size (default 1GiB)
During backup, files exceeding the threshold are split into chunks, each stored as a separate object in backup storage
During restore, chunks of the same file are written concurrently via offsetWriter (pwrite semantics)
Failed chunks are retried using the same mechanism as whole-file retries
Backward compatible: threshold=0 disables chunking, and old manifests (no Chunks field) restore identically to before

Related Issue(s)

Feature Request: Improved builtin backup engine #20159

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

AI Disclosure

PR created by me with support of Claude, fully tested by me before publishing on our own branch and tested with unit tests and e2e on the main branch

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

vitess-bot · 2026-05-22T18:28:55Z

Copilot

Pull request overview

This PR adds optional chunking to the builtinbackupengine so that large MySQL files can be backed up and restored as independently-compressed pieces, enabling higher parallelism (especially beneficial for object stores like S3) and improving restore throughput.

Changes:

Introduces chunk metadata in the backup manifest (FileEntry.Chunks) and new flags to control chunking (--builtinbackup-file-chunk-threshold, --builtinbackup-file-chunk-size).
Updates builtin backup/restore to split large files into chunks for parallel backup and to restore chunked files via parallel WriteAt (pwrite-style) writes into a pre-sized destination.
Adds unit and end-to-end tests validating chunk name parsing and verifying chunked vs non-chunked backups via MANIFEST inspection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
go/vt/mysqlctl/file_close_test.go	Updates tests for the new `backupFile(..., chunkIndex)` signature.
go/vt/mysqlctl/builtinbackupengine.go	Core implementation: chunking flags, manifest schema, chunked backup work scheduling, and parallel chunk restore.
go/vt/mysqlctl/builtinbackupengine_test.go	Adds unit tests for parsing storage names (`parseBackupName`).
go/test/endtoend/backup/vtctlbackup/backup_utils.go	Adds helpers to verify chunking by reading MANIFEST and counting chunks.
go/test/endtoend/backup/vtctlbackup/backup_test.go	Adds end-to-end tests for chunked and non-chunked builtin backups with forced small thresholds/sizes.
go/flags/endtoend/vttestserver.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vttablet.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtctld.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtcombo.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtbackup.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.

Comments suppressed due to low confidence (2)

go/vt/mysqlctl/builtinbackupengine.go:1372

Chunk restore goroutines close over loop variables j and fe (and use dest/fe.Name inside the closure). This can result in writing the wrong chunk offset/data and misreporting errors/logs. Rebind the loop variables (e.g. j := j, feLocal := fe) before starting each goroutine.

			for j := range fe.Chunks {
				g.Go(func() error {
					chunk := &fe.Chunks[j]

					select {

go/vt/mysqlctl/builtinbackupengine.go:1394

Non-chunked restore goroutine closes over i/fe from the enclosing for-loop. This can cause it to restore the wrong file index and log/record errors under the wrong name. Capture locals (e.g. iLocal := i, feLocal := fe) before g.Go.

			// Non-chunked file: restore as before.
			g.Go(func() error {
				name := strconv.Itoa(i)

				select {

rvrangel · 2026-05-25T10:08:19Z

+		if backupFileChunkThreshold > 0 && fileSize > backupFileChunkThreshold {
+			numChunks := (fileSize + backupFileChunkSize - 1) / backupFileChunkSize
+			fe.Chunks = make([]FileChunk, numChunks)
+			for j := range numChunks {
+				offset := j * backupFileChunkSize


it does compile, this sounds like advice for an older Go version?

rvrangel · 2026-05-25T10:11:54Z

+	for _, wi := range workItems {
 		g.Go(func() error {
-			fe := &fes[i]
-			name := strconv.Itoa(i)
+			fe := &fes[wi.feIndex]

-			// Check for context cancellation explicitly because, the way semaphore code is written, theoretically we might
-			// end up not throwing an error even after cancellation. Please see https://cs.opensource.google/go/x/sync/+/refs/tags/v0.1.0:semaphore/semaphore.go;l=66,
-			// which suggests that if the context is already done, `Acquire()` may still succeed without blocking. This introduces
-			// unpredictability in my test cases, so in order to avoid that, I am adding this cancellation check.
 			select {
+			// Skip work if the context has been cancelled (e.g. another goroutine failed).
 			case <-ctxCancel.Done():
 				log.Error(fmt.Sprintf("Context canceled or timed out during %q backup", fe.Name))
-				bh.RecordError(name, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
+				bh.RecordError(wi.name, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
 				return nil
 			default:
 			}

-			// Backup the individual file.
-			var errBackupFile error
-			if errBackupFile = be.backupFile(ctxCancel, params, bh, fe, name); errBackupFile != nil {
-				bh.RecordError(name, vterrors.Wrapf(errBackupFile, "failed to backup file '%s'", name))
+			if errBackupFile := be.backupFile(ctxCancel, params, bh, fe, wi.name, wi.chunkIndex); errBackupFile != nil {
+				bh.RecordError(wi.name, vterrors.Wrapf(errBackupFile, "failed to backup '%s'", wi.name))


same here, this seems pre 1.22: https://go.dev/doc/go1.22#language

 	if files := bh.GetFailedFiles(); len(files) > 0 {
 		newFEs := make([]FileEntry, len(fes))
 		for _, file := range files {
-			fileNb, err := strconv.Atoi(file)
-			if err != nil {
-				return "", vterrors.Wrapf(err, "failed to retry file '%s'", file)
+			feIdx, chunkIdx, parseErr := parseBackupName(file)
+			if parseErr != nil {
+				return "", parseErr
 			}
-			oldFes := fes[fileNb]
-			newFEs[fileNb] = FileEntry{
-				Base:       oldFes.Base,
-				Name:       oldFes.Name,
-				ParentPath: oldFes.ParentPath,
-				Hash:       oldFes.Hash,
-				RetryCount: 1,
+			oldFe := fes[feIdx]
+			if newFEs[feIdx].Name == "" {
+				newFEs[feIdx] = FileEntry{
+					Base:       oldFe.Base,
+					Name:       oldFe.Name,
+					ParentPath: oldFe.ParentPath,
+					Hash:       oldFe.Hash,
+					RetryCount: 1,
+				}
+			}
+			if chunkIdx >= 0 {
+				newFEs[feIdx].Chunks = append(newFEs[feIdx].Chunks, oldFe.Chunks[chunkIdx])
 			}


Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

promptless · 2026-05-25T13:09:27Z

Promptless prepared a documentation update related to this change.

Triggered by PR #20167

Added documentation for the new --builtinbackup-file-chunk-threshold and --builtinbackup-file-chunk-size flags to the backup and restore overview guide. These flags enable parallel backup and restore of large files by splitting them into independently-compressed chunks.

Review: Document builtin backup chunking flags

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

+			fullPath, pathErr := fe.fullPath(params.Cnf)
+			if pathErr != nil {
+				return vterrors.Wrapf(pathErr, "can't get path for chunked file %v", fe.Name)
+			}
+			dest, openErr := os.OpenFile(fullPath, os.O_WRONLY, 0o644)
+			if openErr != nil {
+				return vterrors.Wrapf(openErr, "can't open destination for chunked file %v", fe.Name)
+			}


codecov · 2026-05-25T13:19:57Z

Codecov Report

❌ Patch coverage is 77.95276% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.78%. Comparing base (70c7a72) to head (22620c5).
⚠️ Report is 277 commits behind head on main.

Files with missing lines	Patch %	Lines
go/vt/mysqlctl/builtinbackupengine.go	77.95%	56 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (22620c5). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (70c7a72) HEAD (22620c5)

1 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #20167       +/-   ##
===========================================
- Coverage   69.67%   52.78%   -16.89%     
===========================================
  Files        1614       46     -1568     
  Lines      216793     7290   -209503     
===========================================
- Hits       151044     3848   -147196     
+ Misses      65749     3442    -62307

Flag	Coverage Δ
partial	`52.78% <77.95%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

frouioui · 2026-05-26T14:32:43Z

+type FileChunk struct {
+	StorageName string
+	Offset      int64
+	Size        int64
+	Hash        string
+}


Let's add comments for each field, similar to the FileEntry struct

comments added :)

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

+func computeFileChunks(fileIndex int, fileSize, chunkSize int64) []FileChunk {
+	numChunks := (fileSize + chunkSize - 1) / chunkSize
+	chunks := make([]FileChunk, numChunks)
+	for j := range numChunks {
+		offset := j * chunkSize
+		size := chunkSize
+		if offset+size > fileSize {
+			size = fileSize - offset
+		}
+		chunks[j] = FileChunk{
+			StorageName: fmt.Sprintf("%d-%d", fileIndex, j),
+			Offset:      offset,
+			Size:        size,
+		}
+	}
+	return chunks
+}


+	for _, wi := range workItems {
 		g.Go(func() error {
-			fe := &fes[i]
-			name := strconv.Itoa(i)
+			fe := &fes[wi.feIndex]

-			// Check for context cancellation explicitly because, the way semaphore code is written, theoretically we might
-			// end up not throwing an error even after cancellation. Please see https://cs.opensource.google/go/x/sync/+/refs/tags/v0.1.0:semaphore/semaphore.go;l=66,
-			// which suggests that if the context is already done, `Acquire()` may still succeed without blocking. This introduces
-			// unpredictability in my test cases, so in order to avoid that, I am adding this cancellation check.
 			select {
+			// Skip work if the context has been cancelled (e.g. another goroutine failed).
 			case <-ctxCancel.Done():
 				log.Error(fmt.Sprintf("Context canceled or timed out during %q backup", fe.Name))
-				bh.RecordError(name, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
+				bh.RecordError(wi.name, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
 				return nil
 			default:
 			}

-			// Backup the individual file.
-			var errBackupFile error
-			if errBackupFile = be.backupFile(ctxCancel, params, bh, fe, name); errBackupFile != nil {
-				bh.RecordError(name, vterrors.Wrapf(errBackupFile, "failed to backup file '%s'", name))
+			if errBackupFile := be.backupFile(ctxCancel, params, bh, fe, wi.name, wi.chunkIndex); errBackupFile != nil {
+				bh.RecordError(wi.name, vterrors.Wrapf(errBackupFile, "failed to backup '%s'", wi.name))


+			for j := range fe.Chunks {
+				g.Go(func() error {
+					chunk := &fe.Chunks[j]
+
+					select {
+					// Skip work if the context has been cancelled (e.g. another goroutine failed).
+					case <-ctx.Done():
+						log.Error(fmt.Sprintf("Context canceled or timed out during %q chunk %d restore", fe.Name, j))
+						bh.RecordError(chunk.StorageName, vterrors.Errorf(vtrpcpb.Code_CANCELED, "context canceled"))
+						return nil


+	if backupFileChunkSize <= 0 {
+		return BackupUnusable, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "builtinbackup-file-chunk-size can't be zero")
+	}


+	cleanup := func() error {
+		params.Logger.Infof("closing decompressor")
+		closeAt := time.Now()
+		cerr := closeWithRetry(ctx, params.Logger, closer, "decompressor")
+		if cerr != nil {
+			cerr = vterrors.Wrapf(cerr, "failed to close decompressor %v", name)
+			params.Logger.Error(cerr)
+		}
+		params.Stats.Scope(stats.Operation("Decompressor:Close")).TimedIncrement(time.Since(closeAt))
+		return cerr
+	}


mattlord

go/vt/mysqlctl/builtinbackupengine.go:1386-1390 ignores final close errors for chunked restore destinations. Non-chunked restore propagates destination close failures because they can mean data was not safely flushed; chunked restore only logs them and can report success, then later attempt to start MySQL on an incomplete/corrupt file. Please collect these close errors and return them, and also check the dest.Close() in createChunkedDestinations at line 1367.

go/vt/mysqlctl/builtinbackupengine.go:196-198 / :691-694 has no bound on the chunk count. A small --builtinbackup-file-chunk-size typo, e.g. 1, can allocate one FileChunk and one work item per byte of a large InnoDB file before backup starts. Please enforce a sane minimum chunk size or a max chunks-per-file limit before allocating.

go/vt/mysqlctl/builtinbackupengine.go:281-283 validates chunk size even when chunking is disabled or when taking an incremental backup that may not use chunking. I’d validate chunk-size > 0 only when chunk-threshold > 0, reject negative thresholds explicitly, and fix the message to say must be > 0.

I agree with the compatibility caveat too: once chunking is enabled, those backups are not restorable by older Vitess versions because old restore code ignores Chunks and looks for whole-file objects. That should be called out in release notes summary.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

rvrangel · 2026-05-27T18:54:57Z

thanks @mattlord, I have update the PR! let me know if you spot any other issues or want me to make any changes.

chunked builtin backup engine

3c46fbe

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot AI review requested due to automatic review settings May 22, 2026 18:28

github-actions Bot added this to the v25.0.0 milestone May 22, 2026

Copilot started reviewing on behalf of rvrangel May 22, 2026 18:29 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

small refactor of how files are opened

b3e9314

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

rvrangel marked this pull request as ready for review May 25, 2026 13:06

rvrangel requested a review from mattlord as a code owner May 25, 2026 13:06

Copilot AI review requested due to automatic review settings May 25, 2026 13:06

rvrangel requested a review from frouioui as a code owner May 25, 2026 13:06

Copilot started reviewing on behalf of rvrangel May 25, 2026 13:07 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

linter and other improvements

dd8bb8c

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

frouioui reviewed May 26, 2026

View reviewed changes

add some comments after PR feedback

22620c5

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot AI review requested due to automatic review settings May 26, 2026 15:39

Copilot AI reviewed May 26, 2026

View reviewed changes

mattlord reviewed May 26, 2026

View reviewed changes

PR feedback

1546232

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunked builtin backup engine#20167

chunked builtin backup engine#20167
rvrangel wants to merge 5 commits into
vitessio:mainfrom
rvrangel:builtin-backup-chunking

rvrangel commented May 22, 2026 •

edited

Loading

Uh oh!

vitess-bot Bot commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

rvrangel May 25, 2026

Uh oh!

rvrangel May 25, 2026

Uh oh!

promptless Bot commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 25, 2026 •

edited

Loading

Uh oh!

frouioui May 26, 2026

Uh oh!

rvrangel May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mattlord left a comment

Uh oh!

rvrangel commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rvrangel commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot Bot commented May 22, 2026

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

rvrangel May 25, 2026

Choose a reason for hiding this comment

Uh oh!

rvrangel May 25, 2026

Choose a reason for hiding this comment

Uh oh!

promptless Bot commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

codecov Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

frouioui May 26, 2026

Choose a reason for hiding this comment

Uh oh!

rvrangel May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

rvrangel commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rvrangel commented May 22, 2026 •

edited

Loading

codecov Bot commented May 25, 2026 •

edited

Loading