Skip to content

Fix VTOrc Discovery to also retry discovering tablets which aren't present in database_instance table#10662

Merged
GuptaManan100 merged 3 commits into
vitessio:mainfrom
planetscale:vtorc-discovery-bug
Jul 12, 2022
Merged

Fix VTOrc Discovery to also retry discovering tablets which aren't present in database_instance table#10662
GuptaManan100 merged 3 commits into
vitessio:mainfrom
planetscale:vtorc-discovery-bug

Conversation

@GuptaManan100

@GuptaManan100 GuptaManan100 commented Jul 11, 2022

Copy link
Copy Markdown
Contributor

Description

This PR adds and end to end reproducing the linked issue wherein if the durability policy is set after VTOrc startup, VTOrc cannot pick up the changes and does not discover the vttablets at all, even after a lot of time elapses.

The problem was found in the loop where we refresh the MySQL information for vttablets. There we were only looking for tablets that were outdated i.e. their last valid check was older than the configured time.
In case the durability policy doesn't exist when the tablets are first discovered, we do not read the MySQL information and exit out early, leading to the tablet not being present in the database_instance table at all.
Therefore, the fix entails also checking for these tablets as well in the refresh code. The way to find these tablets is to find the ones that are present in the vitess_tablet table but not in the database_instance table. This operation is equivalent to an anti-join between the two tables, which can be written in MySQL as a left outer join with a predicate for only selecting rows that didn't have a matching counterpart in the database_instance table.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Manan Gupta <manan@planetscale.com>
…nt in the database_instance table

Signed-off-by: Manan Gupta <manan@planetscale.com>
@vitess-bot

vitess-bot Bot commented Jul 11, 2022

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

@GuptaManan100 GuptaManan100 added Type: Bug Component: VTOrc Vitess Orchestrator integration labels Jul 11, 2022
@GuptaManan100 GuptaManan100 changed the title test: add failing test for durability policy setting after vtorc starts Fix VTOrc Discovery to also retry discovering tablets which aren't present in database_instance table Jul 11, 2022
@GuptaManan100 GuptaManan100 marked this pull request as ready for review July 11, 2022 09:38

@frouioui frouioui left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paired with @GuptaManan100 on this. Looks good to me.

Comment thread go/test/endtoend/vtorc/general/vtorc_test.go Outdated
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 merged commit cb47ca6 into vitessio:main Jul 12, 2022
@GuptaManan100 GuptaManan100 deleted the vtorc-discovery-bug branch July 12, 2022 08:19
rsajwani pushed a commit to planetscale/vitess that referenced this pull request Aug 1, 2022
…esent in database_instance table (vitessio#10662) (vitessio#829)

* test: add failing test for durability policy setting after vtorc starts

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add query to also check for vttablets whose information is absent in the database_instance table

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: rename test

Signed-off-by: Manan Gupta <manan@planetscale.com>
@vkovacik

Copy link
Copy Markdown

@GuptaManan100 @deepthi This fix has unfortunately not been merged into 14.0.2 released 11 days ago.

We have been periodically affected by this issue in our large cluster. Every day there about 5-10 vttablets rescheduled by Kubernetes, as a part of normal hw maintenance operations. Almost in every case, restarted vttablet does not start responding immediately to vtorc discovery query, which makes vtorc to infinitely forget the instance.

As a result vtorc audit recovery log reports that the restarted instance is broken (e.g. ReplicationStopped) but vtorc never gets to clearing out the status even though the instance has been completely healthy.

User impact:

  • vtorc stops managing these restarted instances potentially resulting in shard breaking in future
  • We need to restart all the affected vtorc pods every day to clear out alerting on the vtorc's audit log.

Can we please prioritize next 14.x release to include this fix?

@GuptaManan100

Copy link
Copy Markdown
Contributor Author

@vkovacik We have decided to backport this fix and will be available in the next patch release of v14.

GuptaManan100 added a commit to planetscale/vitess that referenced this pull request Sep 16, 2022
…esent in database_instance table (vitessio#10662)

* test: add failing test for durability policy setting after vtorc starts

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add query to also check for vttablets whose information is absent in the database_instance table

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: rename test

Signed-off-by: Manan Gupta <manan@planetscale.com>
GuptaManan100 added a commit that referenced this pull request Sep 19, 2022
…esent in database_instance table (#10662) (#11238)

* test: add failing test for durability policy setting after vtorc starts

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add query to also check for vttablets whose information is absent in the database_instance table

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: rename test

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTOrc Vitess Orchestrator integration Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: VTOrc does not discover tablets if the durability policy is not set initially

4 participants