Ancestral reconstruction bug fixes by jameshadfield · Pull Request #1975 · nextstrain/augur

jameshadfield · 2026-03-24T02:36:43Z

Fixes reconstruction bugs in augur ancestral, most notably when we would use the hardcoded 'N' character as the ambiguous state for AA sequences, and thus report erroneous mutations to N = Asn = Asparagine. See added tests for full details.

These bugs were noticed during refactoring as part of #1958, but they are not a result of the refactor. I've cherry-picked them into a new PR as they are unrelated to that work.

I tested on a 5.5k H3N2 HA dataset and there were no occurrences of these bugs implying that they may be rare. (The worst would need ambiguous ("X") states in the translated tip sequences, which nextclade may never produce??)

Relatedly, I've always wanted to check reconstructed translations against what the reconstructed nucleotide sequence would translate to. The final commit indicates mismatches here in an opt-in fashion.

joverlee521

Glad you noticed this bug! My main question is whether the sequence corrections should be centralized within run_ancestral.

augur/ancestral.py

joverlee521 · 2026-03-27T00:02:18Z

augur/ancestral.py

    """
    ancestral nucleotide reconstruction using TreeTime
    """
+    from treetime import TreeAnc


non-blocking

Is there a particular reason why this command has so many imports within functions instead of all at the top?

That style of imports was already present here and I actually prefer it because it will speed up augur startup if we used it everywhere. (Victor's profiled this somewhere and the majority of augur's start-up time is running all the imports across every augur subcommand.)

(Victor's profiled this somewhere and the majority of augur's start-up time is running all the imports across every augur subcommand.)

Ah right, this is discussed in #472 (comment).

augur/ancestral.py

tests/test_ancestral.py

joverlee521 · 2026-03-27T18:33:49Z

augur/ancestral.py

            raise AugurError("a reference Fasta is required with VCF-format alignments")
        compress_seq = read_vcf(args.alignment, args.vcf_reference)
-        aln = cast(VCF_Alignment, compress_seq['sequences'])
+        aln: VCF_Alignment | MultipleSeqAlignment = cast(VCF_Alignment, compress_seq['sequences'])


Does read_vcf DTRT with ambiguous characters so we don't need to correct it here?

Depends on your definition of DTRT! VCF files tend to suffer from missing/uncertain data being ignored and thus being reported the same as the reference allele. That's essentially what TreeTime does too - any non-ATGCN allele will just get dropped. This is a little different from the FASTA path which allows IUPAC ambiguity codes.

joverlee521 · 2026-03-27T18:45:16Z

augur/ancestral.py

Seeing all of the correct_* calls in this file and test_ancestral.py makes me think the corrections should be centralized within run_ancestral.

I originally wanted to do this because of the code ergonomics you allude to, but it led to running inference with uncorrected data and then correcting it afterwards (which is what we did, in a way, before this refactor). It's much simpler to correct all the data and then run inference. This is why the tests may feel a little awkward - I had originally written them expecting the correction to be done inside run_ancestral - but at the end of the day the main change is replacing MultipleSeqAlignment with a helper function corrected_alignment.

but it led to running inference with uncorrected data and then correcting it afterwards (which is what we did, in a way, before this refactor). It's much simpler to correct all the data and then run inference.

I think I'm missing something, why can't you correct data before inference within run_ancestral?

def run_ancestral(...): corrector = _make_seq_corrector(alphabet) corrected_aln = correct_alignment(aln, corrector) corrected_ref = corrector(ref) if ref else None tt = TreeAnc(tree=T, aln=corrected_aln, ref=corrected_ref...) tt.infer_ancestral_sequences(...)

Yeah, that works, but then you have to return the corrected_ref and make sure you use that (and not the original) for reconstruct_translations. I felt it was much cleaner to correct data as it was read. The only thing that's a little awkward is how the tests now look... I think they're ok but if you want to we can refactor them!

then you have to return the corrected_ref and make sure you use that (and not the original) for reconstruct_translations.

Ahh, this was the part that I overlooked. Agree then it seems cleaner to correct the data as it's being read.

The only thing that's a little awkward is how the tests now look... I think they're ok but if you want to we can refactor them!

The awkwardness can be avoided if these were moved from the unit tests to the functional Cram tests. There'a already a good number of ancestral cram tests that test the ambiguous site behavior.

In preparation for allowing protein-only reconstructions

Tests describe bugs relating to our handling of ambiguous states. The nucleotide bugs are minor, relating to handling of 'X' The AA bugs are bigger, as we hardcode "N" as the ambiguous state however N = Asn = Asparagine.

Previously we would reconstruct ~any sequence in TreeTime and then correct the resulting sequences via the `character_map`. This was error prone, both due to inconsistent application of these corrections as well as not distinguishing between alphabets correctly (see failing tests in parent commit). Here we invert the logic so that we fully correct alignments and reference sequences before inference. This means all states in the alignment / ref are valid; ambiguous states such as "R" (nuc) and "Z" (aa) are valid.

These arguments apply to both nuc & aa reconstructions

There's been a tension between the two ways we handle translations in augur: the "old" way of inferring ancestral seqs then translating them vs the "new" way of using nextclade to get translated tip-sequences and reconstructing nucleotide & aa seqs independently across the tree. The new way is arguable better, as nextclade has some alignment heuristics around CDSs to produce better translations, but opens the door to having mismatches between the inferred nuc seq and the corresponding AA residue. These differences would be surfaced in Auspice when looking at branch mutations.This commit adds an optional warning to check for such mismatches. A trial n=5.5k H3N2 HA dataset had the following differences: - SigPep: 2 terminal nodes. Median residue mismatch count: 13 (range: 1 - 13) - HA1: 2 terminal nodes and 4 internal nodes. Median residue mismatch count: 3 (range: 1 - 5) - HA2: 1 terminal node and 1 internal node were different. Median residue mismatch count: 2 (range: 1 - 2)

codecov · 2026-03-30T00:11:52Z

Codecov Report

❌ Patch coverage is 83.22981% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.50%. Comparing base (c7fb169) to head (a0c8197).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
augur/ancestral.py	82.91%	21 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1975      +/-   ##
==========================================
+ Coverage   74.48%   74.50%   +0.02%     
==========================================
  Files          82       82              
  Lines        9124     9206      +82     
  Branches     1859     1870      +11     
==========================================
+ Hits         6796     6859      +63     
- Misses       2021     2038      +17     
- Partials      307      309       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joverlee521 · 2026-03-30T20:40:58Z

tests/test_ancestral.py

+        pos3_muts = [m for m in sample_c['muts'] if int(m[1:-1]) == 3]
+        assert len(pos3_muts)==1, "Expected only a single mutation (to Y)"
+        assert pos3_muts[0].endswith('Y'), "Expected ambiguous base Y to be reported as mutation"


which should be the only mutation on this branch

Should this be checking all mutations on this branch instead of filtering for position 3?

Suggested change

pos3_muts = [m for m in sample_c['muts'] if int(m[1:-1]) == 3]

assert len(pos3_muts)==1, "Expected only a single mutation (to Y)"

assert pos3_muts[0].endswith('Y'), "Expected ambiguous base Y to be reported as mutation"

assert len(sample_c['muts'])==1, "Expected only a single mutation (to Y)"

assert sample_c['muts'][0].endswith('Y'), "Expected ambiguous base Y to be reported as mutation"

There's always a question of what to test - and I'm not sure of the answer! - we can test the entire inference result or we can test the specific mutation the test is designed to expose. I went with the latter, but open to other points of view.

Hah, I'm only asking because I wasn't sure based on the comment that it "should be the only mutation on this branch". If that means it should be the only mutation at position 3 on this branch, then fine to leave it.

jameshadfield added priority: high To be resolved before other issues and removed priority: high To be resolved before other issues labels Mar 24, 2026

joverlee521 self-requested a review March 26, 2026 22:28

joverlee521 reviewed Mar 27, 2026

View reviewed changes

[ancestral] refactor, add types

28ade1a

In preparation for allowing protein-only reconstructions

jameshadfield force-pushed the james/ancestral-bug-fixes branch from 754c44e to fd56919 Compare March 29, 2026 23:44

jameshadfield added 4 commits March 30, 2026 12:52

[ancestral] add tests to describe reconstruction bugs

9654830

Tests describe bugs relating to our handling of ambiguous states. The nucleotide bugs are minor, relating to handling of 'X' The AA bugs are bigger, as we hardcode "N" as the ambiguous state however N = Asn = Asparagine.

[ancestral] update argument messaging

bae3bac

These arguments apply to both nuc & aa reconstructions

jameshadfield force-pushed the james/ancestral-bug-fixes branch from fd56919 to a0c8197 Compare March 29, 2026 23:52

joverlee521 reviewed Mar 30, 2026

View reviewed changes

jameshadfield mentioned this pull request Apr 1, 2026

allow protein analyses #1958

Open

4 tasks

Conversation

jameshadfield commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 30, 2026

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jameshadfield commented Mar 24, 2026 •

edited

Loading

jameshadfield Mar 29, 2026 •

edited

Loading