RFC: Point in time recovery (PITR) Part 2

This is an extension to #4886 .

## Feature Description

With current PITR support in vitess, it is possible to restore to the last backup timestamp. But if we want to go back to the exact time for the restore, it is not possible to apply that delta change. For e.g. say we have last backup at 12:00 AM and it is required to restore upto 3:15 AM. As of now, we can restore the backup upto 12 AM. With current change, it is possible to restore the data till 3:15 AM.

## Use cases
This remains still the same as part-1(#4886). Here is the following use case.
 - accidental deletion of data
 - corruption of data due to application bugs

## Precondition
 - There should be regular backup taken, just so that we don’t have to replay all the binlogs from the start.
- There should be binlogs available till the required point.
- All preconditions of part-1 RFC should be met.

## Proposed Design

![Screenshot from 2020-06-04 12-00-09](https://user-images.githubusercontent.com/8344391/83722639-f96a9280-a65a-11ea-967f-bedfa0238043.png)

There will be a binlog server which will connect to the mysql server of the master tablet. In a sharded cluster with n shards, there would be n binlog servers.
There is scheduled backup available at regular intervals.
Say we have to recover the data to 6:15 AM, then we will create a restore keyspace from 6 AM backup and it will connect with the binlog server to get the incremental data for the last 15 min.

### Binlog server
There should be a binlog server which uses a reliable file storage system. It should be highly available so that we don’t miss any binlogs. For a sharded environment, we need to run separate binlog servers for each shard. For binlog server, mysql-ripple can be used. The lifecycle of a binlog server has to be managed independently. 

### Applying binlogs
While creating the recovery keyspace, we accept a timestamp. Using that information, we will extract the required GTID up to which the binlog will be applied to restored backup. The recovered replica will replicate from binlog server to apply the binlogs needed to get to the required GTID using the mysql replication command(START SLAVE  UNTIL SQL_BEFORE_GTIDS = ‘xxxx-xx-xx:y-z’)

Note: we will choose the last GTID _before_ the provided recovery timestamp.

### Getting GTID from timestamp

While creating the recovery keyspace, we have got the required timestamp(#Ref) to restore up to. Also we have the GTID of the last recent backup (the time closer to the required time) E.g. for PITR for 6:15AM, the last recent backup is 6 AM ( considering we have 6 hr scheduled backup). Then we will connect with binlog server as replica, asking that start_pos = current_GTID of last backup and we will read all event sequentially till the timestamp of event is less than or equal to the requested timestamp(#Ref), once we reach here, we will note the GTID.

### Getting the data till desired point of time.

At this point we have got the following things.
- Last available backup.
- GTID till which we need to replicate from the binlog (the incremental data)

First, we will restore to the last available backup. Then we will connect to the binlog server as a replica with START SLAVE  UNTIL SQL_BEFORE_GTIDS = ‘xxxx-xx-xx:y-z’ option, which will apply the incremental data till desired point of time. 

## FAQ

### New configuration
While restoring the tablet, you have to specify the binlog server details as the command line argument of the vttablet process.

If we have multiple shards in keyspace, then you need to spawn multiple binlog servers and while doing recovery (of particular shard/shards), pass that information in the cmd line arguments.

### Binlog server and its state management

As of now, there will be no binlog server provided out of box in vitess. You will have to spawn the binlog server yourself and connect it with the master tablet’s database. Since the master tablet can be changed via reparenting/other ways, you have to change the binlog server to point to the new master. Also the binlog server needs to be highly available as the binlog files are critical for the restore. If you have a sharded database, then you will need multiple binlog servers for each master of shard.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Point in time recovery (PITR) Part 2 #6267

Feature Description

Use cases

Precondition

Proposed Design

Binlog server

Applying binlogs

Getting GTID from timestamp

Getting the data till desired point of time.

FAQ

New configuration

Binlog server and its state management

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: Point in time recovery (PITR) Part 2 #6267

Description

Feature Description

Use cases

Precondition

Proposed Design

Binlog server

Applying binlogs

Getting GTID from timestamp

Getting the data till desired point of time.

FAQ

New configuration

Binlog server and its state management

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions