Fix restore from backup execution path to use context from caller#12828
Conversation
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
| }) | ||
| require.NoError(t, err) | ||
|
|
||
| // Set up tm client |
There was a problem hiding this comment.
I think we don't need this.
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
| Prior to v17, this asynchronous process could run indefinitely in the background since it was called using an empty background. In v17 [PR#12830](https://github.com/vitessio/vitess/issues/12830), | ||
| this behavior was changed to use the same context with which the client called the RestoreFromBackup command, which uses action_timeout to wait for any command to finish. | ||
| If you are using VtctldClient to initiate a restore, make sure you provide an appropriate value for action_timeout to give enough time for the restore process to complete. | ||
| Otherwise, the restore will throw an error if the context expires before it completes. |
There was a problem hiding this comment.
What happens to the tablet state? does it go back to its previous tablet_type?
There was a problem hiding this comment.
Restore is tricky. It does not go back to original state and it keeps sentinel file which is indication that last restore failed. Since we cancel in the middle of copy , depends upon where we left it give error during the restore while calling ChangeTabletType()->updateLocked->ts.canServe
There was a problem hiding this comment.
part of it is due to fact that during restore we first delete all the files.
logger.Infof("Restore: deleting existing files")
if err := removeExistingFiles(cnf); err != nil {
return err
}
This prevent us from being coming back to original state.
There was a problem hiding this comment.
So I verified this again .. in topo we go back to 'REPLICA' state but in VTAdmin (which calls show vitess_tablets) will show RESTORE. Reason being as stated above mysql will be down due to incomplete set of files.
There was a problem hiding this comment.
If this is really the behavior then there are two issues.
- healthcheck: when the tablet goes back to REPLICA, it should be sending a healthcheck update to vtgate which updates its cache, which is what is used by
show vitess_tablets. - We had agreed that if a RESTORE fails the tablet should not go back to REPLICA because there is a chance of it serving stale data.
Can you file a separate issue for this?
There was a problem hiding this comment.
for number 2, since restore fail and we stop the MySql , therefore it lost .sock file so we weren't able to connect and do any change to db so it won't serve the data because tabletserver won't function.. the topo is refresh to replica value (therefore vtctlclient give you 'replica') and as I mentioned show vitess_tablet continue to show restore.
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
| func (be *BuiltinBackupEngine) executeRestoreIncrementalBackup(ctx context.Context, params RestoreParams, bh backupstorage.BackupHandle, bm builtinBackupManifest) error { | ||
| params.Logger.Infof("Restoring incremental backup to position: %v", bm.Position) | ||
| createdDir, err := be.restoreFiles(context.Background(), params, bh, bm) | ||
| createdDir, err := be.restoreFiles(ctx, params, bh, bm) |
There was a problem hiding this comment.
this is the bug fix.
| return nil, err | ||
| } | ||
|
|
||
| // if we got restore error then after restarting MYSQL we should return back with error |
There was a problem hiding this comment.
Does the test case validate that mysqld is indeed up after a timed-out restore?
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
shlomi-noach
left a comment
There was a problem hiding this comment.
Looks good, nice catch! There's one change I don't understand, please see inline comment.
| bgCtx := context.Background() | ||
| // If anything failed, we should reset the original tablet type | ||
| if err := tm.tmState.ChangeTabletType(ctx, originalType, DBActionNone); err != nil { | ||
| if err := tm.tmState.ChangeTabletType(bgCtx, originalType, DBActionNone); err != nil { |
There was a problem hiding this comment.
I don't understand this change here. It looks like the opposite of the bugfix. Why would we not reuse ctx?
There was a problem hiding this comment.
We need to set status back to 'REPLICA' (or whatever prior to backup/restore) in all cases. If we use 'ctx' then incase of it being canceled or timed out we won't able to set the status back to original value (#12701)
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
|
I'm going to merge this now, but looks like we may need another follow up PR for the tablet type change. |
Description
Mainly this PR is fixing a bug I identified during PR12703. VtctldClient calls 'RestoreFromBackup' with context. The context timeout value is set using
action_timeoutflag. But DuringexecuteRestoreFullBackupandexecuteRestoreIncrementalBackup, instead of passing the original context, we passcontext.Background()to these methods. This results in restore to run in background even if client context getCanceledorTimed-out. To fix this issue we pass on that same context object.Reference file: https://github.com/vitessio/vitess/blob/main/go/vt/mysqlctl/builtinbackupengine.go
There are few more fixes in this PR
1- Fixed typo in spelling
Canceled2- During restore, if the context get canceled, we report bug to backupHandler instead of
recstructure. This can result in restore not throwing any error to its caller.3- Reporting errors back to client incase of failure during backup or restore.
NOTE
During restore if context get timed-out or canceled, the tablet type remains 'RESTORE' and does not revert back to its original state. The reason is before we start restore, all the existing files first get deleted. Now if restore get aborted in the middle, the MySQL instance won't able to restart with incomplete set of restore files. This result in ChangeTabletType to fail.
Related Issue(s)
closes #12830.
Checklist
Deployment Notes