Vine: Track Last Failure Time

In certain pathological circumstances, Vine will attempt to retry the same transfer of a file multiple times rapidly, even though it fails.  We see this situation when running RSTriPhoton:

- Worker A has the only available copy of file F.
- Manager tells Worker B to fetch F from A and run task T1.
- Worker A fails in someway, but the manager doesn't see it yet.
- Worker B reports back that the transfer failed, and task T1 fails with `INPUT_MISSING`
- Manager assigns another task to worker B and tells it to fetch from A.
- This process repeats for some time until the manager finally runs out of things to do, and notices that worker A fails.

So, the solution is this:
- Each worker should have a `last_transfer_failure` field.
- Upon receiving cache-invalid, the manager should set `last_transfer_failure` of the source worker to the current time.
- When selecting replicas, the manager should skip over replicas on workers that failed within the last N seconds.   (configurable).

The effect of this should be to "slow down" when failures are encountered, so that the manager will go do other things instead and "catch up" on failure information.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vine: Track Last Failure Time #3706

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vine: Track Last Failure Time #3706

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions