Skip to content

Joint Genotype Calling Recalibrated VCFs Duplicate Entries; VEP is (Intentionally) Eating Variants  #966

@evanwu1119

Description

@evanwu1119

Hi all, I encountered some unexpected results when testing the GATK joint genotype calling pipeline on my dataset.

  1. The number of variants in the final recalibrated VCF (joint_germline_recalibrated.vcf.gz) is exactly double that in the raw VCF (joint_germline.vcf.gz) outputted from GenotypeGVCFs. This is because all variants are completely duplicated due to merging separately recalibrated SNP and indel VCFs, the entries are also not identical as shown below. The fix is to sequentially recalibrate for SNPs then indels, or vice versa, on the same VCF.

Raw VCF counts and example entries:

$ bcftools +counts joint_germline.vcf.gz
Number of samples: 15
Number of SNPs:    13717272
Number of INDELs:  3503929
Number of MNPs:    0
Number of others:  0
Number of sites:   17113477

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
1       10146   rs375931351     AC      A       102.67  .       AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;QD=14.67;ReadPosRankSum=0.967;SOR=0.223     GT:AD:DP:GQ:PL
1       10327   rs112750067     T       C       106.77  .       AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;QD=25.36;SOR=2.833    GT:AD:DP:GQ:PGT:PID:PL:PS

Recalibrated VCF counts and example entries:

$ bcftools +counts joint_germline_recalibrated.vcf.gz
Number of samples: 15
Number of SNPs:    27434544
Number of INDELs:  7007858
Number of MNPs:    0
Number of others:  0
Number of sites:   34226954

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
1       10146   rs375931351     AC      A       102.67  PASS    AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=14.67;ReadPosRankSum=0.967;SOR=0.223;VQSLOD=-9.728e-01;culprit=DP        GT:AD:DP:GQ:PL
1       10146   rs375931351     AC      A       102.67  .       AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;QD=14.67;ReadPosRankSum=0.967;SOR=0.223     GT:AD:DP:GQ:PL
1       10327   rs112750067     T       C       106.77  .       AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;QD=25.36;SOR=2.833    GT:AD:DP:GQ:PGT:PID:PL:PS
1       10327   rs112750067     T       C       106.77  VQSRTrancheSNP99.90to100.00     AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.36;SOR=2.833;VQSLOD=-1.242e+00;culprit=SOR      GT:AD:DP:GQ:PGT:PID:PL:PS
  1. I used the --merge annotation method and the final annotated VCF (joint_germline_recalibrated_snpEff_VEP.ann.vcf.gz) is missing almost half of the variants compared to the un-annotated VCFs, which was pretty alarming. After tracing the changes to the VCF it appears that VEP is intentionally filtering out common variants via the --filter_common external argument, which removes alleles with global AF > 0.01. While this is not a bug per-se, it would be helpful to make this an optional argument or at least document it somewhere to prevent scares and confusion.

Variant counts following through the annotation pipeline:

$ bcftools +counts joint_germline_snpEff.ann.vcf.gz
Number of samples: 15
Number of SNPs:    13717272
Number of INDELs:  3503929
Number of MNPs:    0
Number of others:  0
Number of sites:   17113477

$ bcftools +counts joint_germline_snpEff_VEP.ann.vcf.gz
Number of samples: 15
Number of SNPs:    5954775
Number of INDELs:  3370807
Number of MNPs:    0
Number of others:  0
Number of sites:   9223960

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions