In certain pathological circumstances, Vine will attempt to retry the same transfer of a file multiple times rapidly, even though it fails. We see this situation when running RSTriPhoton:
- Worker A has the only available copy of file F.
- Manager tells Worker B to fetch F from A and run task T1.
- Worker A fails in someway, but the manager doesn't see it yet.
- Worker B reports back that the transfer failed, and task T1 fails with
INPUT_MISSING
- Manager assigns another task to worker B and tells it to fetch from A.
- This process repeats for some time until the manager finally runs out of things to do, and notices that worker A fails.
So, the solution is this:
- Each worker should have a
last_transfer_failure field.
- Upon receiving cache-invalid, the manager should set
last_transfer_failure of the source worker to the current time.
- When selecting replicas, the manager should skip over replicas on workers that failed within the last N seconds. (configurable).
The effect of this should be to "slow down" when failures are encountered, so that the manager will go do other things instead and "catch up" on failure information.
In certain pathological circumstances, Vine will attempt to retry the same transfer of a file multiple times rapidly, even though it fails. We see this situation when running RSTriPhoton:
INPUT_MISSINGSo, the solution is this:
last_transfer_failurefield.last_transfer_failureof the source worker to the current time.The effect of this should be to "slow down" when failures are encountered, so that the manager will go do other things instead and "catch up" on failure information.