-
Notifications
You must be signed in to change notification settings - Fork 21
Description
I have a naive question about your MSA methodology. Why would the aligner choose to align short segments (<20 bp) at the start of the alignment and then allow a large deletion (>1000bp) before aligning again? Perhaps I am misunderstanding the parameters to abPOA.
Example: I have a 100 homologous sequences -- a few full length (~2200 bp) and many fragments (~300 bp in length). I used local alignment as there is always a chance in these datasets that small portions of the edges of these sequences could be unrelated sequence, although in this dataset it doesn't look to be the case. I ran abPOA as:
../bin/abpoa family-seqs.fa -t comparison.matrix -O 25,0 -E 5,0 -m 1 -r 1
Where the comparison matrix is simply:
A G C T N
A 9 -6 -15 -17 -1
G -6 10 -15 -15 -1
C -15 -15 10 -6 -1
T -17 -15 -6 9 -1
N -1 -1 -1 -1 -1
Here is the start of the MSA with the first 11 sequences anchoring with minor or no overlap to other sequences before entering a large deletion:
gi|526567 1 AAGC------------------------------------------------------------------------------------------------ 4 [1]
gi|527347 1 GC------------------------------------------------------------------------------------------------ 2 [2]
gi|526255 1 ATGACA---------------------T-------------------------------------------------------------------- 7 [3]
gi|526360 1 CA---------------------T-------------------------------------------------------------------- 3 [4]
gi|527067 1 TG---------------------------------------------------------------------------------------- 2 [5]
gi|527527 1 ATA------------------------------------------------------------------------------------- 3 [6]
gi|527416 1 C------------------------------------------------------------------------------------ 1 [7]
gi|526343 1 CCACCTGGAAG------------------------------------------------------------------------- 11 [8]
gi|527241 1 TGCTT-------------------------------------------------------------------- 5 [9]
gi|530654 1 CTA----------------------------------------------------------------- 3 [10]
gi|525985 1 TACAT------------------------------------------------------------ 5 [11]
gi|525696 1 CCTGTCTTCTACCATCTCCGAACTCCAGATACTCCAACAGCGAAAGGGATCTGGGCCCAA 60 [12]
gi|525698 1 CCTGTCTTCCACCATCTCCGAACTCCAGATACTCCTACAGCGAAAGGGATCTGGGCCCAA 60 [13]
gi|525697 1 CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC 60 [14]
gi|525699 1 CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC 60 [15]
The gap in the first 11 sequences goes on for about 1000 bp in the alignment before starting backup up again:
gi|526567 5 --------------------------------------------------------------------------------ATAT------AGAG-T--T- 14 [1]
gi|527347 3 --------------------------------------------------------------------------------ATGT------CAA-----T- 10 [2]
gi|526255 8 -----------------------------------------------------------------------------------TAT---GAG------T- 14 [3]
gi|526360 4 -----------------------------------------------------------------------------------TGT---GGGG-----C- 11 [4]
gi|527067 3 -------------------------------------------------------------------------------TGTATAT-----AAT-G--T- 14 [5]
gi|527527 4 ------------------------------------------------------------------------AA------ATGT------AAA------- 12 [6]
gi|527416 2 --------------------------------------------------------------------------------GTGA------AAGTA----- 10 [7]
gi|526343 12 ---------------------------------------------------------------------------------------CCAAAGT-TCATG 23 [8]
gi|527241 6 -----------------------------------------------------------------------------------T------AAAT-T--T- 12 [9]
gi|530654 3 ---------------------------------------------------------------------------------------------------- 3 [10]
gi|525985 6 ------------------GGCAGAGGAAAAAGTG-CACAAATTGGGACCAGAAAAT---ATAA---A--GGGGAG---TTGTGT------AGAT-T--T- 65 [11]
gi|525696 1040 CAGATGGTTAAATGGGAGGACAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T- 1127 [12]
gi|525698 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGATCTATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T- 1127 [13]
gi|525697 1042 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGTAT---GAAAT-C--T- 1129 [14]
gi|525699 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGTAG-G-CACAAG--GGGACCAATAAAT-GAATGATCTATTGAGAAGCATCTGTGCAT---GAAAT-C--T- 1127 [15]
gi|525700 749 CAGATGGGTAAATGCGAGGGCAGAGAATGAAT-G-CACAAG--GGTACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAC---CAAAT-C--TG 837 [16]
Certainly there was enough opportunity to align the first two bases of gi|527347 ('GC') somewhere much closer. Perhaps my affine gap parameters (values I use typically in pairwise alignment) are not stringent enough in this context? Any other ideas?