Skip to content

Commit f28a546

Browse files
authored
Merge pull request #145 from nibscles/umi
Add UMI reads processing capability
2 parents 1c9454c + 47c7bfc commit f28a546

File tree

5 files changed

+190
-10
lines changed

5 files changed

+190
-10
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) a
88

99
### Added
1010

11+
- [#145](https://github.com/nf-core/sarek/pull/145) - Add `UMI annotation and consensus` functionality to Sarek
12+
1113
### Changed
1214

1315
- [#237](https://github.com/nf-core/sarek/pull/237) - Switch `bwa 0.7.17` for `bwa-mem2 2.0`

conf/test_umi.config

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
process{
2+
withName:UMIMapBamFile {
3+
cpus = 2
4+
memory = 8.GB
5+
}
6+
}

docs/usage.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@
3737
- [--cf_window](#--cf_window)
3838
- [--no_gvcf](#--no_gvcf)
3939
- [--no_strelka_bp](#--no_strelka_bp)
40+
- [--umi](#--umi)
41+
- [--read_structure1](#--read_structure1)
42+
- [--read_structure2](#--read_structure2)
4043
- [--pon](#--pon)
4144
- [--pon_index](#--pon_index)
4245
- [Annotation](#annotation)
@@ -478,6 +481,21 @@ Path to CADD SNVs index.
478481

479482
Enable genesplicer within VEP.
480483

484+
### --umi
485+
486+
If provided, UMIs steps will be run to extract and annotate the reads with UMIs and create consensus reads: this part of the pipeline uses *FGBIO* to convert the fastq files into a unmapped BAM, where reads are tagged with the UMIs extracted from the fastq sequences. In order to allow the correct tagging, the UMI sequence must be contained in the read sequence itself, and not in the FASTQ name.
487+
Following this step, the uBam is aligned and reads are then grouped based on mapping position and UMI tag.
488+
Finally, reads in the same groups are collapsed to create a consensus read. To create consensus, we have chosen to use the *adjacency method* [ref](https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/).
489+
In order for the correct tagging to be performed, a read structure needs to be specified as indicated below.
490+
491+
### --read_structure1
492+
493+
When processing UMIs, a read structure should always be provided for each of the fastq files, to allow the correct annotation of the bam file. If the read does not contain any UMI, the structure will be +T (i.e. only template of any length). The read structure follows a format adopted by different tools, and described [here](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures)
494+
495+
### --read_structure2
496+
497+
When processing UMIs, a read structure should always be provided for each of the fastq files, to allow the correct annotation of the bam file. If the read does not contain any UMI, the structure will be +T (i.e. only template of any length). The read structure follows a format adopted by different tools, and described [here](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures)
498+
481499
## Reference genomes
482500

483501
The pipeline config files come bundled with paths to the Illumina iGenomes reference index files.

environment.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ dependencies:
1717
- bioconda::control-freec=11.5
1818
- bioconda::ensembl-vep=99.2
1919
- bioconda::fastqc=0.11.9
20+
- bioconda::fgbio=1.1.0
2021
- bioconda::freebayes=1.3.2
2122
- bioconda::gatk4-spark=4.1.7.0
2223
- bioconda::genesplicer=1.0
@@ -25,6 +26,7 @@ dependencies:
2526
- bioconda::msisensor=0.5
2627
- bioconda::multiqc=1.8
2728
- bioconda::qualimap=2.2.2d
29+
- bioconda::samblaster=0.1.24
2830
- bioconda::samtools=1.9
2931
- bioconda::snpeff=4.3.1t
3032
- bioconda::strelka=2.9.10
@@ -34,3 +36,4 @@ dependencies:
3436
- bioconda::vcftools=0.1.16
3537
- conda-forge::pigz=2.3.4
3638
- conda-forge::r-ggplot2=3.3.0
39+

main.nf

Lines changed: 161 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,11 @@ def helpMessage() {
9797
--ignore_soft_clipped_bases [bool] Do not analyze soft clipped bases in the reads for GATK Mutect2
9898
--no_gvcf [bool] No g.vcf output from GATK HaplotypeCaller
9999
--no_strelka_bp [bool] Will not use Manta candidateSmallIndels for Strelka (not recommended by Best Practices)
100+
--umi [bool] If provided, UMIs steps will be run to extract and annotate the reads with UMI and create consensus reads
101+
--read_structure1 [string] When processing UMIs, a read structure should always be provided for each of the fastq files. If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).
102+
See: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures
103+
--read_structure2 [string] When processing UMIs, a read structure should always be provided for each of the fastq files. If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).
104+
See: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures
100105
--pon [file] Panel-of-normals VCF (bgzipped) for GATK Mutect2 / Sentieon TNscope
101106
See: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_CreateSomaticPanelOfNormals.php
102107
--pon_index [file] Index of pon panel-of-normals VCF
@@ -1068,7 +1073,10 @@ if (params.split_fastq){
10681073

10691074
inputPairReads = inputPairReads.dump(tag:'INPUT')
10701075

1071-
(inputPairReads, inputPairReadsTrimGalore, inputPairReadsFastQC) = inputPairReads.into(3)
1076+
(inputPairReads, inputPairReadsTrimGalore, inputPairReadsFastQC, inputPairReadsUMI) = inputPairReads.into(4)
1077+
if(params.umi) inputPairReads.close()
1078+
else inputPairReadsUMI.close()
1079+
10721080

10731081
// STEP 0.5: QC ON READS
10741082

@@ -1178,18 +1186,161 @@ process TrimGalore {
11781186
"""
11791187
}
11801188

1181-
if (!params.trim_fastq) inputPairReadsTrimGalore.close()
1189+
/*
1190+
================================================================================
1191+
UMIs PROCESSING
1192+
================================================================================
1193+
*/
1194+
1195+
// UMI - STEP 1 - ANNOTATE
1196+
// the process needs to convert fastq to unmapped bam
1197+
// and while doing the conversion, tag the bam field RX with the UMI sequence
1198+
1199+
process UMIFastqToBAM {
1200+
1201+
publishDir "${params.outdir}/Reports/${idSample}/UMI/${idSample}_${idRun}", mode: params.publishDirMode
1202+
1203+
input:
1204+
set idPatient, idSample, idRun, file("${idSample}_${idRun}_R1.fastq.gz"), file("${idSample}_${idRun}_R2.fastq.gz") from inputPairReadsUMI
1205+
1206+
output:
1207+
tuple val(idPatient), val(idSample), val(idRun), file("${idSample}_umi_converted.bam") into umi_converted_bams_ch
1208+
1209+
when: params.umi && params.read_structure1 && params.read_structure2
1210+
1211+
1212+
// tmp folder for fgbio might be solved more elengantly?
1213+
1214+
script:
1215+
"""
1216+
mkdir tmpFolder
1217+
1218+
fgbio --tmp-dir=${PWD}/tmpFolder \
1219+
FastqToBam \
1220+
-i "${idSample}_${idRun}_R1.fastq.gz" "${idSample}_${idRun}_R2.fastq.gz" \
1221+
-o "${idSample}_umi_converted.bam" \
1222+
--read-structures $params.read_structure1 $params.read_structure2 \
1223+
--sample $idSample \
1224+
--library $idSample
1225+
"""
1226+
1227+
}
1228+
1229+
// UMI - STEP 2 - MAP THE BAM FILE
1230+
// this is necessary because the UMI groups are created based on
1231+
// mapping position + same UMI tag
1232+
1233+
process UMIMapBamFile {
1234+
1235+
input:
1236+
set idPatient, idSample, idRun, file(convertedBam) from umi_converted_bams_ch
1237+
file(bwaIndex) from ch_bwa
1238+
file(fasta) from ch_fasta
1239+
file(fastaFai) from ch_fai
1240+
1241+
output:
1242+
tuple val(idPatient), val(idSample), val(idRun), file("${idSample}_umi_unsorted.bam") into umi_aligned_bams_ch
1243+
1244+
when: params.umi && params.read_structure1 && params.read_structure2
1245+
1246+
script:
1247+
"""
1248+
samtools bam2fq -T RX ${convertedBam} | \
1249+
bwa mem -p -t ${task.cpus} -C -M -R \"@RG\\tID:${idSample}\\tSM:${idSample}\\tPL:Illumina\" \
1250+
${fasta} - | \
1251+
samtools view -bS - > ${idSample}_umi_unsorted.bam
1252+
"""
1253+
1254+
}
1255+
1256+
// UMI - STEP 3 - GROUP READS BY UMIs
1257+
// We have chose the Adjacency method, following the nice paper and blog explanation integrated in both
1258+
// UMItools and FGBIO
1259+
// https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/
1260+
// alternatively we can define this as input for the user to choose from
1261+
1262+
process GroupReadsByUmi {
1263+
1264+
publishDir "${params.outdir}/Reports/${idSample}/UMI/${idSample}_${idRun}", mode: params.publishDirMode
1265+
1266+
input:
1267+
set idPatient, idSample, idRun, file(alignedBam) from umi_aligned_bams_ch
1268+
1269+
output:
1270+
file("${idSample}_umi_histogram.txt") into umi_histogram_ch
1271+
tuple val(idPatient), val(idSample), val(idRun), file("${idSample}_umi-grouped.bam") into umi_grouped_bams_ch
1272+
1273+
when: params.umi && params.read_structure1 && params.read_structure2
1274+
1275+
script:
1276+
"""
1277+
mkdir tmpFolder
1278+
1279+
samtools view -h $alignedBam | \
1280+
samblaster -M --addMateTags | \
1281+
samtools view -Sb - >${idSample}_unsorted_tagged.bam
1282+
1283+
fgbio --tmp-dir=${PWD}/tmpFolder \
1284+
GroupReadsByUmi \
1285+
-s Adjacency \
1286+
-i ${idSample}_unsorted_tagged.bam \
1287+
-o ${idSample}_umi-grouped.bam \
1288+
-f ${idSample}_umi_histogram.txt
1289+
"""
1290+
1291+
}
1292+
1293+
// UMI - STEP 4 - CALL MOLECULAR CONSENSUS
1294+
// Now that the reads are organised by UMI groups a molecular consensus will be created
1295+
// the resulting bam file will be again unmapped and therefore can be fed into the
1296+
// existing workflow from the step mapping
1297+
1298+
process CallMolecularConsensusReads {
1299+
1300+
publishDir "${params.outdir}/Reports/${idSample}/UMI/${idSample}_${idRun}", mode: params.publishDirMode
1301+
1302+
input:
1303+
set idPatient, idSample, idRun, file(groupedBamFile) from umi_grouped_bams_ch
1304+
1305+
output:
1306+
tuple val(idPatient), val(idSample), val(idRun), file("${idSample}_umi-consensus.bam") into consensus_bam_ch
1307+
1308+
when: params.umi && params.read_structure1 && params.read_structure2
1309+
1310+
script:
1311+
"""
1312+
mkdir tmpFolder
1313+
1314+
fgbio --tmp-dir=${PWD}/tmpFolder \
1315+
CallMolecularConsensusReads \
1316+
-i $groupedBamFile \
1317+
-o ${idSample}_umi-consensus.bam \
1318+
-M 1 -S Coordinate
1319+
"""
1320+
}
1321+
1322+
// ################# END OF UMI READS PRE-PROCESSING
1323+
// from this moment on the generated uBam files can feed into the existing tools
11821324

11831325
// STEP 1: MAPPING READS TO REFERENCE GENOME WITH BWA MEM
11841326

1185-
if (params.trim_fastq) inputPairReads = outputPairReadsTrimGalore
1186-
else inputPairReads = inputPairReads.mix(inputBam)
1327+
if (params.umi){
1328+
(inputPairReads, input_pair_reads_sentieon) = consensus_bam_ch.into(2)
11871329

1188-
inputPairReads = inputPairReads.dump(tag:'INPUT')
1330+
inputPairReads = inputPairReads.dump(tag:'INPUT')
11891331

1190-
(inputPairReads, input_pair_reads_sentieon) = inputPairReads.into(2)
1191-
if (params.sentieon) inputPairReads.close()
1192-
else input_pair_reads_sentieon.close()
1332+
if (params.sentieon) inputPairReads.close()
1333+
else input_pair_reads_sentieon.close()
1334+
}
1335+
else {
1336+
if (params.trim_fastq) inputPairReads = outputPairReadsTrimGalore
1337+
else inputPairReads = inputPairReads.mix(inputBam)
1338+
inputPairReads = inputPairReads.dump(tag:'INPUT')
1339+
1340+
(inputPairReads, input_pair_reads_sentieon) = inputPairReads.into(2)
1341+
if (params.sentieon) inputPairReads.close()
1342+
else input_pair_reads_sentieon.close()
1343+
}
11931344

11941345
process MapReads {
11951346
label 'cpus_max'
@@ -2506,7 +2657,7 @@ process MergeMutect2Stats {
25062657

25072658
when: 'mutect2' in tools
25082659

2509-
script:
2660+
script:
25102661
stats = statsFiles.collect{ "-stats ${it} " }.join(' ')
25112662
"""
25122663
gatk --java-options "-Xmx${task.memory.toGiga()}g" \
@@ -2670,7 +2821,7 @@ process CalculateContamination {
26702821

26712822
input:
26722823
set idPatient, idSampleNormal, idSampleTumor, file(bamNormal), file(baiNormal), file(bamTumor), file(baiTumor), file(mergedPileup) from pairBamCalculateContamination
2673-
2824+
26742825
output:
26752826
set idPatient, val("${idSampleTumor}_vs_${idSampleNormal}"), file("${idSampleTumor}_contamination.table") into contaminationTable
26762827

0 commit comments

Comments
 (0)