Fix VTOrc Discovery to also retry discovering tablets which aren't present in database_instance table#10662
Conversation
Signed-off-by: Manan Gupta <manan@planetscale.com>
…nt in the database_instance table Signed-off-by: Manan Gupta <manan@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
frouioui
left a comment
There was a problem hiding this comment.
Paired with @GuptaManan100 on this. Looks good to me.
Signed-off-by: Manan Gupta <manan@planetscale.com>
…esent in database_instance table (vitessio#10662) (vitessio#829) * test: add failing test for durability policy setting after vtorc starts Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: add query to also check for vttablets whose information is absent in the database_instance table Signed-off-by: Manan Gupta <manan@planetscale.com> * test: rename test Signed-off-by: Manan Gupta <manan@planetscale.com>
|
@GuptaManan100 @deepthi This fix has unfortunately not been merged into 14.0.2 released 11 days ago. We have been periodically affected by this issue in our large cluster. Every day there about 5-10 vttablets rescheduled by Kubernetes, as a part of normal hw maintenance operations. Almost in every case, restarted vttablet does not start responding immediately to vtorc discovery query, which makes vtorc to infinitely forget the instance. As a result vtorc audit recovery log reports that the restarted instance is broken (e.g. ReplicationStopped) but vtorc never gets to clearing out the status even though the instance has been completely healthy. User impact:
Can we please prioritize next 14.x release to include this fix? |
|
@vkovacik We have decided to backport this fix and will be available in the next patch release of v14. |
…esent in database_instance table (vitessio#10662) * test: add failing test for durability policy setting after vtorc starts Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: add query to also check for vttablets whose information is absent in the database_instance table Signed-off-by: Manan Gupta <manan@planetscale.com> * test: rename test Signed-off-by: Manan Gupta <manan@planetscale.com>
…esent in database_instance table (#10662) (#11238) * test: add failing test for durability policy setting after vtorc starts Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: add query to also check for vttablets whose information is absent in the database_instance table Signed-off-by: Manan Gupta <manan@planetscale.com> * test: rename test Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Manan Gupta <manan@planetscale.com>
Description
This PR adds and end to end reproducing the linked issue wherein if the durability policy is set after VTOrc startup, VTOrc cannot pick up the changes and does not discover the vttablets at all, even after a lot of time elapses.
The problem was found in the loop where we refresh the MySQL information for vttablets. There we were only looking for tablets that were outdated i.e. their last valid check was older than the configured time.
In case the durability policy doesn't exist when the tablets are first discovered, we do not read the MySQL information and exit out early, leading to the tablet not being present in the
database_instancetable at all.Therefore, the fix entails also checking for these tablets as well in the refresh code. The way to find these tablets is to find the ones that are present in the
vitess_tablettable but not in thedatabase_instancetable. This operation is equivalent to an anti-join between the two tables, which can be written in MySQL as a left outer join with a predicate for only selecting rows that didn't have a matching counterpart in thedatabase_instancetable.Related Issue(s)
Checklist
Deployment Notes