Skip to content

Support WDL 1.2 "Extended" File/Directory format#834

Draft
adamnovak wants to merge 2 commits intochanzuckerberg:mainfrom
adamnovak:extended-format
Draft

Support WDL 1.2 "Extended" File/Directory format#834
adamnovak wants to merge 2 commits intochanzuckerberg:mainfrom
adamnovak:extended-format

Conversation

@adamnovak
Copy link
Contributor

@adamnovak adamnovak commented Feb 3, 2026

Motivation

This should fix #833 by implementing the "extended" File and Directory format from WDL 1.2.

Approach

This changes File and Directory to have the extended-format dicts as their values, with the actual path being fetched out of or replaced in ["location"] when needed.

This is still a draft; it still needs:

  • Logic to recursively adjust locations inside Directory listings when adjusting the parent location. (Maybe we should only keep the top-level location set?)
  • Logic to fill in listings when loading a Directory from a path or URL.
  • A way to get the extended format as workflow/task output for the user (though we could get away with not having this)
  • The ability to key the cache on the listing contents (I touched the caching slightly but I don't know if it's really hooked in)
  • Possibly checksum computation; the spec suggests you ought to have it but doesn't define how.
  • The ability to reject the extended format in WDL 1.1- runs (might be able to skip this?).
  • Support for actually presenting the listing as described in the input JSON object, and not whatever's currently on disk at the location, to user code.
  • Support for input Directory values without a location, using the listing instead to construct them.
  • Support for localizing files at the extended-format basename when it is different than the basename of the location.

Checklist

  • Add appropriate test(s) to the automatic suite
  • Use make pretty to reformat the code with ruff format
  • Use make check to statically check the code using ruff check and mypy
  • Send PR from a dedicated branch without unrelated edits
  • Ensure compatibility with this project's MIT license

@adamnovak
Copy link
Contributor Author

@mlin Does this seem like the right general approach to you?

@mlin
Copy link
Collaborator

mlin commented Feb 4, 2026

@adamnovak I have to say I'm initially a little bearish on changing the internal representation of Value.{File,Directory} to this extent at this point in miniwdl's lifecycle. But maybe I just need some time to grow confidence in it.

Wondering, is your main interest in the cache coherence aspect, or in the place to stick arbitrary metadata? Could we get pretty far by preprocessing the extended input JSON and keeping some crappy global dict of path to metadata?

@adamnovak
Copy link
Contributor Author

adamnovak commented Feb 4, 2026

I think my main interest is in allowing a Directory to take over representing the directory tree that it means, and to take that responsibility away from the filesystem. The new input format in the spec lets you say something like "The input is a directory with this file as path file1 and this web URL as path file2", even if that directory structure doesn't exist anywhere. MiniWDL would either have to build that locally, or else let Directory know enough to build that on the fly.

In Toil, we have to deal with building these trees on the fly all the time because Toil's backing storage abstraction stores only files, not directories. So we encode whole trees of what files go where into strings and use those as WDL Directory string values.

If MiniWDL had a Directory abstraction that knew it was responsible for information about what files go where, then Toil could work with that and throw away a lot of hacks.

Maybe the right approach is to leave File alone as basically a String, but to make Directory (which isn't standardized anyway until the 1.2 WDL spec that also adds the extended format) a hierarchical data structure.

@mlin
Copy link
Collaborator

mlin commented Feb 7, 2026

@adamnovak If our goal were only to support the "extended" file/directory input format then I feel we could do it easily by preprocessing the input JSON, materializing the desired posix structure (using symlinks, hardlinks, or last-resort copies where needed), and starting the workflow on that. But I think I'm hearing from you that Toil needs finer control of each task's filesystem, is that right?

miniwdl's posix filesystem assumptions are pretty intentional in keeping it "mini" so I think that's where we're diverging a bit. I do think it should have a pluggable architecture to accommodate more exotic needs. The CallCache is pluggable (but the semantics need more documentation). Not so much the filesystem interactions -- perhaps that's the direction to head?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support WDL 1.2 "extended" syntax for File and Directory values

2 participants