Skip to content

Vine: Track Last Failure Time #3706

@dthain

Description

@dthain

In certain pathological circumstances, Vine will attempt to retry the same transfer of a file multiple times rapidly, even though it fails. We see this situation when running RSTriPhoton:

  • Worker A has the only available copy of file F.
  • Manager tells Worker B to fetch F from A and run task T1.
  • Worker A fails in someway, but the manager doesn't see it yet.
  • Worker B reports back that the transfer failed, and task T1 fails with INPUT_MISSING
  • Manager assigns another task to worker B and tells it to fetch from A.
  • This process repeats for some time until the manager finally runs out of things to do, and notices that worker A fails.

So, the solution is this:

  • Each worker should have a last_transfer_failure field.
  • Upon receiving cache-invalid, the manager should set last_transfer_failure of the source worker to the current time.
  • When selecting replicas, the manager should skip over replicas on workers that failed within the last N seconds. (configurable).

The effect of this should be to "slow down" when failures are encountered, so that the manager will go do other things instead and "catch up" on failure information.

Metadata

Metadata

Labels

TaskVinebugFor modifications that fix a flaw in the code.critical

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions