Skip to content

Workflows-as-programs support#55

Merged
tsibley merged 2 commits intomainfrom
trs/workflows-as-programs
Mar 28, 2025
Merged

Workflows-as-programs support#55
tsibley merged 2 commits intomainfrom
trs/workflows-as-programs

Conversation

@tsibley
Copy link
Copy Markdown
Contributor

@tsibley tsibley commented Oct 30, 2024

See commit messages.

Checklist

  • Checks pass

@tsibley tsibley changed the base branch from main to trs/remove-hardcoded-references October 30, 2024 20:23
@huddlej huddlej mentioned this pull request Nov 5, 2024
36 tasks
Base automatically changed from trs/remove-hardcoded-references to main November 6, 2024 22:17
Comment on lines +9 to +14
# Use default configuration values. Extend with Snakemake's --configfile/--config options.
configfile: os.path.join(workflow.basedir, "defaults/config.yaml")

# Use custom configuration from analysis directory (i.e. working dir), if any.
if os.path.exists("config.yaml"):
configfile: "config.yaml"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour of multiple configfiles is somewhat documented but I think will prove a tripping hazard to people.

Multiple configfiles (as per this section of code, and/or via additional --configfiles) will overwrite scalars and lists; dicts will be merged, but nested dicts won't be; there's no way to remove a dictionary key as far as I can find.

This has implications for how we structure configs, specifically around how we encode flags to skip rules or skip a particular argument.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nod, for sure, but "multiple config files" is already our established approach, so this is making that more accessible/standard but isn't changing approaches.

We could change approaches, though, by combining the configs ourselves outside of Snakemake's machinery. I didn't think there was appetite for that, but it's very possible.

dicts will be merged, but nested dicts won't be; there's no way to remove a dictionary key as far as I can find.

Hmm. In my recollection (and testing right now), nested dicts are merged? That is, the config merge is deep.

The merge is additive though, so yes, there is no way to remove keys.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested dicts are merged

Oh yeah, I wasn't formatting my YAML correctly. That they are merged will be very helpful for us.

Copy link
Copy Markdown
Member

@jameshadfield jameshadfield Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"multiple config files" is already our established approach, so this is making that more accessible/standard but isn't changing approaches.

I can see the point of view that this isn't a workflows-as-programs thing rather a pathgon-repo-{guide,template} thing. Similar to my comment in a recent meeting about merging in additional data I think these concepts should be thought of as intertwined rather than isolated in order to make workflows-as-programs as successful as possible.

Our current use of multiple config files are largely just adding in specific steps, e.g I think the examples in measles are a good summary of how we use them more generally:

  • Measles automated builds use an additional config to add in a deploy rule and the associated config params
  • Measles CI uses an additional config to add a rule which copies example data and thus avoids downloading the entire dataset.

I think it's rarer¹ to use additional configs to change parameters of builds, or to subset build targets etc. If we want to recommend this approach -- an approach that I think we should use and workflows-as-programs seems to lead us towards anyway -- then it's not going to play nicely with our current config structures. Some examples:

  • Avian-flu uses a~nested-dict to define the builds to run (subtype x time window). So we can't override this with a config file without having some approach to encoding "skip this" / "no-op". (Mentioned here as well.) Edit: actually in its current formation it's a dict of lists so there is a way to overwrite, albeit a little clunky.
  • Seasonal flu (at least, the public builds) uses an array of segments and a dictionary of builds (builds here is h1n1pdm_2y, h3n2_2y etc). So config overlays aren't going to be able to target a subset of those builds.
  • RSV uses two lists to encode the builds & resolutions which will play nicely with config overrides but is predicated on all builds being run for all resolutions which isn't always the case.
  • (Somewhat unrelated, but...) Dengue hardcodes serotypes/genes so that'll need lifting into config to make the most of workflows-as-programs.

cc @joverlee521

¹ The most complex workflows are seasonal-flu and ncov, so maybe those utilise the style of config merging I'm talking about here. I don't think seasonal-flu does at all, but maybe? It'd be great to have examples in Nextstrain where we do make use of config merging to replace / subset config parameters.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"multiple config files" is already our established approach, so this is making that more accessible/standard but isn't changing approaches.

...

I think it's rarer¹ to use additional configs to change parameters of builds, or to subset build targets etc. If we want to recommend this approach -- an approach that I think we should use and workflows-as-programs seems to lead us towards anyway -- then it's not going to play nicely with our current config structures. Some examples:

(avain-flu, seasonal-flu, RSV examples elided)

I think it's totally fair to draw a line, somewhere (not exactly sure where), and say the pathogen repos on one side of it are "legacy" and will continue to be maintained the way they have been historically, and will not use these new patterns, while pathogen repos on the other side of the line are "current" and will be keep reasonably up-to-date and consistent with our best practices. (We can also decide, down the road, to invest the effort into converting "legacy" repos into "current" repos, if and when that feels like it has a positive return-on-effort.)

Note: not currently taking a strong stance on where that line is 😁

* (Somewhat unrelated, but...) Dengue [hardcodes](https://github.com/nextstrain/dengue/blob/6f9144f7767c0cac4755c60cd5c069a5d2c4ed6f/phylogenetic/Snakefile#L3-L4) serotypes/genes so that'll need lifting into config to make the most of workflows-as-programs.

Note that measles and yellow fever do the same thing.

Copy link
Copy Markdown
Member

@jameshadfield jameshadfield Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of this is the WNV/washington-state config, which layers on top of the default config.

Thanks for the example! That's a great example of a simple config structure (i.e. a single dataset target, no wildcards in the workflow) where merging works nicely. It will be interesting to see how that develops as WNV expands (e.g. to have NA & global targets) - does it go the mpox route of one config (or one config overlay) per target, or does it tend towards a more complicated config structure. It's specifically the more complicated structures I was focusing on above.

(It also uses an approach I think we've moved away from, namely

refine:
  treetime_params: --coalescent opt --date-inference marginal ...

which makes it straightforward to remove/add options/params as you're in charge of the entire string. As a thought exercise, if the base config instead specified

refine:
  coalescent: opt

and the overlay config wanted to skip that argument entirely how would it do it?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a thought exercise, if the base config instead specified

refine:
  coalescent: opt

and the overlay config wanted to skip that argument entirely how would it do it?)

I can't for the life of me find where I wrote about this recentlyish, but I'd suggested previously:

refine:
  coalescent: ~   # or: null

That's a great example of a simple config structure (i.e. a single dataset target, no wildcards in the workflow) where merging works nicely. It will be interesting to see how that develops as WNV expands (e.g. to have NA & global targets) - does it go the mpox route of one config (or one config overlay) per target, or does it tend towards a more complicated config structure.

The niceness of simple config structures is why, in Slack recently, I suggested we consider "small multiples" of config (optionally with tooling to allow derived configs/inheritance) rather than a large complex cross-product config.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't for the life of me find…

oh, it was a freakin' Google Docs comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dicts will be merged, but nested dicts won't be

P.S. This behaviour changed in Snakemake 6.11.0 (Nov 2021). Before that nested dicts were not merged. (Yep, it was another potentially-breaking change in a minor version release.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'd suggested previously:

Yeah, saw that. I think something like that's the right move, and the corresponding snakemake logic needs to ensure that it drops the --coalescent argument name as well as the null value. (We can probably use a helpful helper function here.)

I prototyped a similar approach for nextstrain/avian-flu#104 but in the context of defining what builds to run (for gisaid there are 48 by default, so it's nice to (a) not enumerate them all and (b) be able to subset them for a particular analysis). One of the prototypes looked like this:

builds:
  h5nx:
    all-time:
      - ha
      - na
    2y: 
  h5n1: 
  h7n9: 
  h9n2: 

But I didn't like the necessity of listing out all the builds you don't want to run. I switched to a list-based interface which is much nicer in this respect. (There were other reasons for switching too, namely the hierarchy here isn't right and to make it right results in very verbose dictionaries.)

@jameshadfield
Copy link
Copy Markdown
Member

I'd like to see config parsing such as this become part of our general snakemake workflows regardless of workflows as programs CLI work.

@tsibley
Copy link
Copy Markdown
Contributor Author

tsibley commented Nov 12, 2024

I'd like to see config parsing such as this become part of our general snakemake workflows regardless of workflows as programs CLI work.

Agreed! (ever since I started 🙃)

Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is the traceability of config files with this change. I think I need to wrap my head around the fact that the same exact config file can result in different input files depending on which working directory was used.

Comment on lines +3 to +12
exclude: "dropped_strains.txt"
include_genome: "include_strains_genome.txt"
include_N450: "include_strains_N450.txt"
reference: "measles_reference.gb"
reference_N450: "measles_reference_N450.gb"
reference_N450_fasta: "measles_reference_N450.fasta"
colors: "colors.tsv"
auspice_config: "auspice_config.json"
auspice_config_N450: "auspice_config_N450.json"
description: "description.md"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly concerned that this change may confuse external users who want to trace the default files. It's not immediately clear that these are pulling from "defaults/".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ingest/defaults/config.yaml comments on files' locations, but this phylogenetic/defaults/config.yaml did not and so I also did not bother. But we very easily could if we think it'll be a tripping hazard.

If we're terribly concerned and don't think a comment is enough, we can also leave the defaults/ prefix in the default config. It's handled for backwards-compat.

# Special-case defaults/… for backwards compatibility with older
# configs. We could achieve the same behaviour with a symlink
# (defaults/defaults → .) but that seems less clear.
if path.startswith("defaults/"):
defaults_path = os.path.join(workflow.basedir, path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the additional comments should be enough.

we can also leave the defaults/ prefix in the default config.

I thought the defaults/ prefix was removed so that the config will automatically use input files in the analysis directory? If we keep the defaults/ prefix, users must override the config param or put their files in a defaults/ directory within their analysis directory right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the defaults/ prefix was removed so that the config will automatically use input files in the analysis directory?

That's my understanding as well.

The search order is:

  1. Filename in current analysis directory (if it exists)
  2. Filename in workflow's defaults/ directory. (If the config-specified filename has "defaults/" then that's ignored for this step.)

By not using defaults/ in the config-specified filename it makes it possible to swap out the workflow file (TSV etc) file with one of the same name in the current analysis directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the defaults/ prefix was removed so that the config will automatically use input files in the analysis directory?

Sorry, yes, you're both correct with the code in this PR as-is and what I said. The bit I forgot to say as I was rushing out the door (ahh! 🤦) was that we could strip the defaults/ prefix off when searching for files in the analysis directory, thus letting us keep the defaults/ prefix in the default config. Given we're already special-casing it for backwards-compat, we could instead special-case it the other way for analysis directory lookup instead.

(This is actually the first pattern I tinkered with over a month ago, but I thought it was more confusing than a simple unchanging path + search order.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be more than one, of course.

Hmm? There's only one config object at run time, and that's what we'd write out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run multiple invocations of the workflow with different overlays, or invocations of different workflows, each will write sets of files into the analysis directory, and these sets may not overlap or may partially overlap. To use an avian-flu example, I may have an analysis directory with both gisaid & cattle-outbreak datasets.

Copy link
Copy Markdown
Contributor Author

@tsibley tsibley Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Multiple invocations (whether same workflow or not) writing to the same working analysis directory seemed to me like something we didn't need to support (in terms of keeping things separate): there's always the chance of overwriting files in that case (often overwriting is desired/expected) and confusion around what's output from which invocation. Do you think we definitely do need to support that sort of use case? If so, the implications of that would seem to extend far beyond the proposed results/config.yaml discussed here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we definitely do need to support that sort of use case?

No - but it's something to keep in mind as the structure is built out, because what we're building out does support it and I think it'll be a common use case. For instance, if we could include some identifier based on the invocation within the filename we wouldn't be simply overwriting results/config.yaml each time. This follows on from my comment above, I think we should also write out information beyond the config, e.g. pathogen version used, runtime version, invocation command etc. You could shoehorn that into the written-out config, or create a directory for each workflow invocation with files detailing all that information; I'd think logs & benchmarks may better live here as well. Obviously we don't have to do all this now, but thinking about the structure now seems like the right move.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, if we could include some identifier based on the invocation within the filename we wouldn't be simply overwriting results/config.yaml each time.

Right, of course, but we're still overwriting all/most of the other files in results/ each time though. I don't see the harm in matching that behaviour with results/config.yaml for now.

I think we should also write out information beyond the config, e.g. pathogen version used, runtime version, invocation command etc. You could shoehorn that into the written-out config, or create a directory for each workflow invocation with files detailing all that information; I'd think logs & benchmarks may better live here as well. Obviously we don't have to do all this now, but thinking about the structure now seems like the right move.

A timestamped directory to contain all this individual-run information makes the most sense to me and is the most conventional… but I don't think we need to worry about this structure right now now. I think it's too early to implement now; we should focus on the core workflows-as-programs changes first before building on them while they're still forming. I also think there's some consideration to take on what outputs get differentiated between invocations (e.g. protected from overwrite) and what outputs don't, and why that split is useful vs. the default situation of "nothing's protected" or "everything's protected".

Comment on lines +40 to +42
# The path to the local geolocation rules for this pathogen.
# The path should be relative to the working directory (e.g. --directory).
# If the path doesn't exist in the working directory, the file in the workflow's defaults/ directory it used instead (if it exists).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly concerned that users will be confused by the term "working directory", especially those who are not familiar with Snakemake. (I had originally made everything relative to the ingest directory so that users had a concrete reference point.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For user reference, we can say "your analysis directory", perhaps with an explanatory parenthetical like "(Snakemake's working directory, --directory, often your current directory)". This seems like a tripping hazard we can address various ways.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a tripping hazard we can address various ways.

Totally! Just want to make sure we address it as paths are often the most confusing thing...

Seems like the working directory can be different things depending on how you run the workflow.

  1. native snakemake = the current directory or provided --directory.
  2. nextstrain build <directory> = the provided <directory> (can be complicated by additional Snakemake --directory option)
  3. nextstrain run <pathogen-name> <workflow-name> <analysis-directory> = provided <analysis-directory>

I do like that (3) is the most straight-forward so all for moving in that direction.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this up after work in avian-flu and discussions in slack, I think we'll need words to describe the following 4 locations (even if some are only for developers). This is assuming we go with the slight variant to (3) which allows things like nextstrain run avian-flu phylogenetic/h5n1-cattle-outbreak my-analysis.

  1. "pathogen-name" - maybe obvious, but this may include version identifiers etc
  2. the workflow name - "ingest", "phylogenetic", "phylogenetic/gisaid", "phylogenetic/clade-i" etc. workflow.basedir in snakemake.
  3. the analysis directory. os.getcwd() in snakemake.
  4. The "base" workflow directory if there are (sub-)workflows which use a common snakefile. In avian-flu this'd be avian-flu/Snakefile (or maybe one day avian-flu/phylogenetic/Snakefile). This should only be relevant for workflow developers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The pathogen spec I'm working from in Nextstrain CLI is <name>[@<version>[=<url>]]. A URL is only accepted by nextstrain setup, as a way to override the default assumption that the name refers to a Nextstrain pathogen repo and/or that the version refers to a Git rev in the same.

  2. I want to stress that I think it's important to conceptually treat/talk about these mostly as names (like you've done!) and not paths, even if they're used internally as paths initially (and likely indefinitely). Doing so makes it more sensible to do things like alias (or "redirect") old names to new names (e.g. via entries in nextstrain-pathogen.yaml, or symlinks) to ease use or preserve backwards compatibility.

  3. I'd call this the "working analysis directory" in full, shortening to "analysis directory" (for Nextstrain CLI) or "working directory" (for Snakemake development).

  4. I think avian-flu/Snakefile (or avian-flu/phylogenetic/Snakefile) would ideally be avian-flu/phylogenetic/rules/base.smk (or core.smk or something similar). That is, we'd stick to the convention of only using the Snakefile filename for supported workflow entrypoints. I think this will reduce confusion.

@tsibley
Copy link
Copy Markdown
Contributor Author

tsibley commented Nov 25, 2024

I think I need to wrap my head around the fact that the same exact config file can result in different input files depending on which working directory was used.

Nod. One upside here is that we can say things like "send us a copy of your analysis directory (working directory)" instead of "send us your config, plus any additional files you configured".

@genehack
Copy link
Copy Markdown
Contributor

I've reviewed the code changes; I think having the ability to have multiple (semi-)concurrent outputs based on config variations is a worthwhile addition.

I don't think I have strong opinions about the implementation specifics, probably because I don't have sufficient experience with using Snakemake in properly complicated workflows.

@tsibley tsibley force-pushed the trs/workflows-as-programs branch from 91d8f0a to f0745a2 Compare February 13, 2025 21:31
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 6, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 6, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 10, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 10, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 10, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 10, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 11, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 11, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added 2 commits March 26, 2025 12:50
It is very useful to be able to run the workflows outside of their own
source directories, e.g. to use Snakemake's --directory (-d) and/or
--snakefile (-s) options.  Doing so avoids polluting the source
directory with untracked custom inputs and workflow outputs and allows
them to be managed separately as a working analysis directory.  It also
allows multiple invocations of the same workflow to be run concurrently
by specifying different working analysis directories.

Implementing this portability requires that all source code-relative
paths are qualified with workflow.basedir or workflow.current_basedir as
appropriate.  Relative paths given in config are resolved by a new
helper function, resolve_config_path(), and end up qualified with the
working analysis directory (if the path exists there) or the workflow
source code directory (as a default file).  This allows a working
analysis directory to be populated before workflow invocation with files
that override defaults (e.g. annotations.tsv).

Explicitly resolving paths given in config seems best, as we're not able
to reliably resolve such paths automatically without tagging config
values.  YAML tagging seems less ideal for workflow users than requiring
workflow authors to use resolve_config_path().  Snakemake supports YAML
Template Engine (YTE) for config that could perhaps be used for path
resolution, but again, this ends up as a more complex config and
user-visible complication that seems best avoided.

Related-to: <nextstrain/public#1>
Declare compatibility with `nextstrain run`.

Related-to: <nextstrain/public#1>
Related-to: <nextstrain/cli#407>
@tsibley tsibley force-pushed the trs/workflows-as-programs branch from f0745a2 to 83b446d Compare March 26, 2025 20:11
@tsibley tsibley changed the title wip! workflows as programs prototyping Workflows-as-programs support Mar 26, 2025
@tsibley tsibley marked this pull request as ready for review March 26, 2025 20:12
@tsibley
Copy link
Copy Markdown
Contributor Author

tsibley commented Mar 26, 2025

Repushed with complete commit messages. No code changes. This is ready for merge as far as I'm concerned, but if folks have 11th hour suggestions after months of silence 🙃, I'll hear them out.

tsibley added a commit to nextstrain/cli that referenced this pull request Mar 27, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 27, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Mar 27, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
Copy link
Copy Markdown
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed changes, did not test, but looks good

@jameshadfield
Copy link
Copy Markdown
Member

No changes requested by me, but it would be good (after merge) for us to summarise the differences between this and what I implemented for avian-flu. The usage of wildcards (wildcard placeholders?) in config strings complicates things of course, so it'll never be identical, but skimming through the code:

  • measles' resolve_config_path special-cases "defaults/" whereas avian-flu doesn't, but otherwise the search order is the same
  • The code used to resolve paths is a little different (workflow.basedir vs os.path.normpath(os.path.join(workflow.current_basedir, "..")). I think the outcome is the same, this just reflects the file structure of measles vs avian-flu.
  • Measles has a compatibility: nextstrain run: true block in the nextstrain-pathogen.yaml whereas avian-flu does not. This didn't stop me running avian-flu with nextstrain run, but maybe it would have stopped me downloading/setting-up/installing/pulling/untaring it?
  • The custom_rules handling seems different? (Here's avian-flu)
  • Measles references scripts via {workflow.basedir}/<path> whereas I added a helper script(<path>) function. (Avian-flu has a lot more scripts than measles.)
  • Other changes I've missed?

@tsibley
Copy link
Copy Markdown
Contributor Author

tsibley commented Mar 28, 2025

Measles has a compatibility: nextstrain run: true block in the nextstrain-pathogen.yaml whereas avian-flu does not. This didn't stop me running avian-flu with nextstrain run, but maybe it would have stopped me downloading/setting-up/installing/pulling/untaring it?

Yes, its absence will make nextstrain setup fail checks. The idea is to make support for nextstrain run explicit.

(We can spell this another way in nextstrain-pathogen.yaml if folks have better ideas.)

The custom_rules handling seems different?

It is. IMO, what avian-flu does currently is not correct (or at least not very useful) for use with external analysis directories. See also extensive prior discussion.

Measles references scripts via {workflow.basedir}/<path> whereas I added a helper script(<path>) function. (Avian-flu has a lot more scripts than measles.)

I prefer less indirection when feasible, so that's what I did for measles. I do think there's room for variance between workflows here to accommodate different usability needs, unless we as a group really prefer to make everything reference scripts the same way.

That said, I don't think avian-flu needs script(). It's used 6 times. Measles references scripts 7 times. The diff to remove script() from avian-flu looks something like this.

Other changes I've missed?

Your list is a great start! I haven't closely compared them myself yet.

@tsibley tsibley merged commit efda442 into main Mar 28, 2025
5 checks passed
@tsibley tsibley deleted the trs/workflows-as-programs branch March 28, 2025 20:04
tsibley added a commit to nextstrain/cli that referenced this pull request Apr 2, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Apr 2, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Apr 7, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
tsibley added a commit to nextstrain/cli that referenced this pull request Apr 7, 2025
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants