Skip to content

Fix restore from backup execution path to use context from caller#12828

Merged
deepthi merged 14 commits into
vitessio:mainfrom
planetscale:bugfixInPR12703
Apr 20, 2023
Merged

Fix restore from backup execution path to use context from caller#12828
deepthi merged 14 commits into
vitessio:mainfrom
planetscale:bugfixInPR12703

Conversation

@rsajwani

@rsajwani rsajwani commented Apr 4, 2023

Copy link
Copy Markdown
Contributor

Description

Mainly this PR is fixing a bug I identified during PR12703. VtctldClient calls 'RestoreFromBackup' with context. The context timeout value is set using action_timeout flag. But During executeRestoreFullBackup and executeRestoreIncrementalBackup, instead of passing the original context, we pass context.Background() to these methods. This results in restore to run in background even if client context get Canceled or Timed-out. To fix this issue we pass on that same context object.

Reference file: https://github.com/vitessio/vitess/blob/main/go/vt/mysqlctl/builtinbackupengine.go

There are few more fixes in this PR

1- Fixed typo in spelling Canceled
2- During restore, if the context get canceled, we report bug to backupHandler instead of rec structure. This can result in restore not throwing any error to its caller.
3- Reporting errors back to client incase of failure during backup or restore.

NOTE

During restore if context get timed-out or canceled, the tablet type remains 'RESTORE' and does not revert back to its original state. The reason is before we start restore, all the existing files first get deleted. Now if restore get aborted in the middle, the MySQL instance won't able to restart with incomplete set of restore files. This result in ChangeTabletType to fail.

Related Issue(s)

closes #12830.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@vitess-bot vitess-bot Bot added the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label Apr 4, 2023
@vitess-bot

vitess-bot Bot commented Apr 4, 2023

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@vitess-bot vitess-bot Bot added the NeedsWebsiteDocsUpdate What it says label Apr 4, 2023
@github-actions github-actions Bot added this to the v17.0.0 milestone Apr 4, 2023
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Comment thread go/vt/mysqlctl/builtinbackupengine.go
@rsajwani rsajwani added Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Backup and Restore and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Apr 5, 2023
})
require.NoError(t, err)

// Set up tm client

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this.

@rsajwani rsajwani marked this pull request as ready for review April 5, 2023 12:39
@rsajwani rsajwani changed the title Fix error reporting during restore while context concelled. Fix error reporting during restore while context cancelled. Apr 5, 2023
@rsajwani

rsajwani commented Apr 5, 2023

Copy link
Copy Markdown
Contributor Author

spoke offline with @mattlord. He is suggesting to fix #12830 in this same PR... I will update the PR with additional changes once I research on #12830

rsajwani added 5 commits April 6, 2023 14:53
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@rsajwani rsajwani changed the title Fix error reporting during restore while context cancelled. Fix restore from backup execution path to use context from caller Apr 11, 2023
@rsajwani

Copy link
Copy Markdown
Contributor Author

spoke offline with @mattlord. He is suggesting to fix #12830 in this same PR... I will update the PR with additional changes once I research on #12830

done

Comment thread go/vt/mysqlctl/builtinbackupengine_test.go Outdated
Comment thread go/vt/mysqlctl/builtinbackupengine_test.go Outdated
@deepthi deepthi added the release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) label Apr 11, 2023
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Comment thread changelog/17.0/17.0.0/summary.md Outdated
Comment thread changelog/17.0/17.0.0/summary.md Outdated
Comment thread changelog/17.0/17.0.0/summary.md Outdated
Comment thread changelog/17.0/17.0.0/summary.md Outdated
Comment thread changelog/17.0/17.0.0/summary.md Outdated
Comment thread changelog/17.0/17.0.0/summary.md Outdated
Prior to v17, this asynchronous process could run indefinitely in the background since it was called using an empty background. In v17 [PR#12830](https://github.com/vitessio/vitess/issues/12830),
this behavior was changed to use the same context with which the client called the RestoreFromBackup command, which uses action_timeout to wait for any command to finish.
If you are using VtctldClient to initiate a restore, make sure you provide an appropriate value for action_timeout to give enough time for the restore process to complete.
Otherwise, the restore will throw an error if the context expires before it completes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the tablet state? does it go back to its previous tablet_type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore is tricky. It does not go back to original state and it keeps sentinel file which is indication that last restore failed. Since we cancel in the middle of copy , depends upon where we left it give error during the restore while calling ChangeTabletType()->updateLocked->ts.canServe

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of it is due to fact that during restore we first delete all the files.
logger.Infof("Restore: deleting existing files")
if err := removeExistingFiles(cnf); err != nil {
return err
}
This prevent us from being coming back to original state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I verified this again .. in topo we go back to 'REPLICA' state but in VTAdmin (which calls show vitess_tablets) will show RESTORE. Reason being as stated above mysql will be down due to incomplete set of files.

@deepthi deepthi Apr 20, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is really the behavior then there are two issues.

  1. healthcheck: when the tablet goes back to REPLICA, it should be sending a healthcheck update to vtgate which updates its cache, which is what is used by show vitess_tablets.
  2. We had agreed that if a RESTORE fails the tablet should not go back to REPLICA because there is a chance of it serving stale data.
    Can you file a separate issue for this?

@rsajwani rsajwani Apr 20, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for number 2, since restore fail and we stop the MySql , therefore it lost .sock file so we weren't able to connect and do any change to db so it won't serve the data because tabletserver won't function.. the topo is refresh to replica value (therefore vtctlclient give you 'replica') and as I mentioned show vitess_tablet continue to show restore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug filed
#12945

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
func (be *BuiltinBackupEngine) executeRestoreIncrementalBackup(ctx context.Context, params RestoreParams, bh backupstorage.BackupHandle, bm builtinBackupManifest) error {
params.Logger.Infof("Restoring incremental backup to position: %v", bm.Position)
createdDir, err := be.restoreFiles(context.Background(), params, bh, bm)
createdDir, err := be.restoreFiles(ctx, params, bh, bm)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the bug fix.

Comment thread go/vt/mysqlctl/backup.go Outdated
return nil, err
}

// if we got restore error then after restarting MYSQL we should return back with error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test case validate that mysqld is indeed up after a timed-out restore?

@rsajwani rsajwani self-assigned this Apr 17, 2023
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@rsajwani rsajwani removed the release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) label Apr 19, 2023

@shlomi-noach shlomi-noach left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice catch! There's one change I don't understand, please see inline comment.

bgCtx := context.Background()
// If anything failed, we should reset the original tablet type
if err := tm.tmState.ChangeTabletType(ctx, originalType, DBActionNone); err != nil {
if err := tm.tmState.ChangeTabletType(bgCtx, originalType, DBActionNone); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change here. It looks like the opposite of the bugfix. Why would we not reuse ctx?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to set status back to 'REPLICA' (or whatever prior to backup/restore) in all cases. If we use 'ctx' then incase of it being canceled or timed out we won't able to set the status back to original value (#12701)

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@deepthi

deepthi commented Apr 20, 2023

Copy link
Copy Markdown
Contributor

I'm going to merge this now, but looks like we may need another follow up PR for the tablet type change.

@deepthi deepthi merged commit a836318 into vitessio:main Apr 20, 2023
@deepthi deepthi deleted the bugfixInPR12703 branch April 20, 2023 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Backup and Restore Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

executeRestoreFullBackup should use context from caller to pass down to 'restoreFiles'

3 participants