-
Notifications
You must be signed in to change notification settings - Fork 526
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Hi all, I encountered some unexpected results when testing the GATK joint genotype calling pipeline on my dataset.
- The number of variants in the final recalibrated VCF (
joint_germline_recalibrated.vcf.gz) is exactly double that in the raw VCF (joint_germline.vcf.gz) outputted from GenotypeGVCFs. This is because all variants are completely duplicated due to merging separately recalibrated SNP and indel VCFs, the entries are also not identical as shown below. The fix is to sequentially recalibrate for SNPs then indels, or vice versa, on the same VCF.
Raw VCF counts and example entries:
$ bcftools +counts joint_germline.vcf.gz
Number of samples: 15
Number of SNPs: 13717272
Number of INDELs: 3503929
Number of MNPs: 0
Number of others: 0
Number of sites: 17113477
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 10146 rs375931351 AC A 102.67 . AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;QD=14.67;ReadPosRankSum=0.967;SOR=0.223 GT:AD:DP:GQ:PL
1 10327 rs112750067 T C 106.77 . AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;QD=25.36;SOR=2.833 GT:AD:DP:GQ:PGT:PID:PL:PS
Recalibrated VCF counts and example entries:
$ bcftools +counts joint_germline_recalibrated.vcf.gz
Number of samples: 15
Number of SNPs: 27434544
Number of INDELs: 7007858
Number of MNPs: 0
Number of others: 0
Number of sites: 34226954
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 10146 rs375931351 AC A 102.67 PASS AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=14.67;ReadPosRankSum=0.967;SOR=0.223;VQSLOD=-9.728e-01;culprit=DP GT:AD:DP:GQ:PL
1 10146 rs375931351 AC A 102.67 . AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;QD=14.67;ReadPosRankSum=0.967;SOR=0.223 GT:AD:DP:GQ:PL
1 10327 rs112750067 T C 106.77 . AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;QD=25.36;SOR=2.833 GT:AD:DP:GQ:PGT:PID:PL:PS
1 10327 rs112750067 T C 106.77 VQSRTrancheSNP99.90to100.00 AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.36;SOR=2.833;VQSLOD=-1.242e+00;culprit=SOR GT:AD:DP:GQ:PGT:PID:PL:PS
- I used the
--mergeannotation method and the final annotated VCF (joint_germline_recalibrated_snpEff_VEP.ann.vcf.gz) is missing almost half of the variants compared to the un-annotated VCFs, which was pretty alarming. After tracing the changes to the VCF it appears that VEP is intentionally filtering out common variants via the--filter_commonexternal argument, which removes alleles with global AF > 0.01. While this is not a bug per-se, it would be helpful to make this an optional argument or at least document it somewhere to prevent scares and confusion.
Variant counts following through the annotation pipeline:
$ bcftools +counts joint_germline_snpEff.ann.vcf.gz
Number of samples: 15
Number of SNPs: 13717272
Number of INDELs: 3503929
Number of MNPs: 0
Number of others: 0
Number of sites: 17113477
$ bcftools +counts joint_germline_snpEff_VEP.ann.vcf.gz
Number of samples: 15
Number of SNPs: 5954775
Number of INDELs: 3370807
Number of MNPs: 0
Number of others: 0
Number of sites: 9223960
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working