Skip to content

MSA anchoring #70

@rmhubley

Description

@rmhubley

I have a naive question about your MSA methodology. Why would the aligner choose to align short segments (<20 bp) at the start of the alignment and then allow a large deletion (>1000bp) before aligning again? Perhaps I am misunderstanding the parameters to abPOA.

Example: I have a 100 homologous sequences -- a few full length (~2200 bp) and many fragments (~300 bp in length). I used local alignment as there is always a chance in these datasets that small portions of the edges of these sequences could be unrelated sequence, although in this dataset it doesn't look to be the case. I ran abPOA as:

../bin/abpoa family-seqs.fa -t comparison.matrix -O 25,0 -E 5,0  -m 1 -r 1

Where the comparison matrix is simply:

    A   G   C   T   N      
A   9  -6 -15 -17  -1     
G  -6  10 -15 -15  -1     
C -15 -15  10  -6  -1     
T -17 -15  -6   9  -1     
N  -1  -1  -1  -1  -1     

Here is the start of the MSA with the first 11 sequences anchoring with minor or no overlap to other sequences before entering a large deletion:

gi|526567    1 AAGC------------------------------------------------------------------------------------------------    4 [1]
gi|527347    1   GC------------------------------------------------------------------------------------------------    2 [2]
gi|526255    1     ATGACA---------------------T--------------------------------------------------------------------    7 [3]
gi|526360    1         CA---------------------T--------------------------------------------------------------------    3 [4]
gi|527067    1           TG----------------------------------------------------------------------------------------    2 [5]
gi|527527    1             ATA-------------------------------------------------------------------------------------    3 [6]
gi|527416    1                C------------------------------------------------------------------------------------    1 [7]
gi|526343    1                 CCACCTGGAAG-------------------------------------------------------------------------    11 [8]
gi|527241    1                            TGCTT--------------------------------------------------------------------    5 [9]
gi|530654    1                                 CTA-----------------------------------------------------------------    3 [10]
gi|525985    1                                    TACAT------------------------------------------------------------    5 [11]
gi|525696    1                                         CCTGTCTTCTACCATCTCCGAACTCCAGATACTCCAACAGCGAAAGGGATCTGGGCCCAA    60 [12]
gi|525698    1                                         CCTGTCTTCCACCATCTCCGAACTCCAGATACTCCTACAGCGAAAGGGATCTGGGCCCAA    60 [13]
gi|525697    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [14]
gi|525699    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [15]

The gap in the first 11 sequences goes on for about 1000 bp in the alignment before starting backup up again:

gi|526567    5 --------------------------------------------------------------------------------ATAT------AGAG-T--T-    14 [1]
gi|527347    3 --------------------------------------------------------------------------------ATGT------CAA-----T-    10 [2]
gi|526255    8 -----------------------------------------------------------------------------------TAT---GAG------T-    14 [3]
gi|526360    4 -----------------------------------------------------------------------------------TGT---GGGG-----C-    11 [4]
gi|527067    3 -------------------------------------------------------------------------------TGTATAT-----AAT-G--T-    14 [5]
gi|527527    4 ------------------------------------------------------------------------AA------ATGT------AAA-------    12 [6]
gi|527416    2 --------------------------------------------------------------------------------GTGA------AAGTA-----    10 [7]
gi|526343   12 ---------------------------------------------------------------------------------------CCAAAGT-TCATG    23 [8]
gi|527241    6 -----------------------------------------------------------------------------------T------AAAT-T--T-    12 [9]
gi|530654    3 ----------------------------------------------------------------------------------------------------    3 [10]
gi|525985    6 ------------------GGCAGAGGAAAAAGTG-CACAAATTGGGACCAGAAAAT---ATAA---A--GGGGAG---TTGTGT------AGAT-T--T-    65 [11]
gi|525696 1040 CAGATGGTTAAATGGGAGGACAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [12]
gi|525698 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGATCTATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [13]
gi|525697 1042 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGTAT---GAAAT-C--T-    1129 [14]
gi|525699 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGTAG-G-CACAAG--GGGACCAATAAAT-GAATGATCTATTGAGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [15]
gi|525700  749 CAGATGGGTAAATGCGAGGGCAGAGAATGAAT-G-CACAAG--GGTACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAC---CAAAT-C--TG    837 [16]

Certainly there was enough opportunity to align the first two bases of gi|527347 ('GC') somewhere much closer. Perhaps my affine gap parameters (values I use typically in pairwise alignment) are not stringent enough in this context? Any other ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions