Skip to content

Latest commit

 

History

History
50 lines (44 loc) · 2.77 KB

File metadata and controls

50 lines (44 loc) · 2.77 KB

The Nordic Precision Medicine Initiative - Meeting No 5 - Reykjavìk, Iceland, 2018/06

Sarek, a portable workflow for WGS analysis of germline and somatic mutations

Maxime Garcia 123*, Szilveszter Juhos 123*, Malin Larsson 456, Teresita Díaz de Ståhl 13, Johanna Sandgren 13, Jesper Eisfeldt 73, Sebastian DiLorenzo 85A, Marcel Martin B5C, Pall Olason 95A, Phil Ewels B2C, Björn Nystedt 95A*, Monica Nistér 13, Max Käller 2D, *Corresponding Author

  1. Barntumörbanken, Dept. of Oncology Pathology;
  2. Science for Life Laboratory;
  3. Karolinska Institutet;
  4. Dept. of Physics, Chemistry and Biology;
  5. National Bioinformatics Infrastructure Sweden, Science for Life Laboratory;
  6. Linköping University;
  7. Clinical Genetics, Dept. of Molecular Medicine and Surgery;
  8. Dept. of Medical Sciences;
  9. Dept. of Cell and Molecular Biology; A. Uppsala University; B. Dept. of Biochemistry and Biophysics; C. Stockholm University; D. School of Biotechnology, Division of Gene Technology, Royal Institute of Technology

We present Sarek, a portable Open Source pipeline to resolve germline and somatic variants from WGS data: it is written in Nextflow, a domain-specific language for workflow building. It processes normal samples or normal/tumor pairs (with the option to include matched relapses).

Sarek is based on GATK best practices to prepare short-read data, which is done in parallel for a tumor/normal pair sample. After these preprocessing steps several variant callers scan the resulting BAM files: Manta for structural variants; Strelka and GATK HaplotypeCaller for germline variants; Freebayes, MuTect1, MuTect2 and Strelka for somatic variants; ASCAT to estimate sample heterogeneity, ploidy and CNVs. At the end of the analysis the resulting VCF files can be annotated by SNPEff and/or VEP to facilitate further downstream processing. Our ongoing effort focuses in filtering and prioritizing the annotated variants.

Sarek is based on Docker and Singularity containers, enabling version tracking, reproducibility and handling sensitive data. It is designed with flexible environments in mind, like running on a local fat node, a HTC cluster or in a cloud environment like AWS. The workflow is capable of accommodating further variant callers. Besides variant calls, the workflow provides quality controls presented by MultiQC. Checkpoints allow the software to be started from FastQ, BAM or VCF. Besides WGS data, it is capable to process inputs from WES or gene panels. The pipeline currently use GRCh37 or GRCh38 as a reference genome, it is also possible to add custom genomes. It has been successfully used to analyze more than two hundred WGS samples sent to National Genomics Infrastructure (Science for Life Laboratory) from different users. The MIT licensed Open Source code can be downloaded from GitHub.