A segmentation approach to analyze DNA methylation patterns and identify differentially methylation regions from whole-genome datasets.
MethyLasso models DNA methylation data using a nonparametric regression framework known as a Generalized Additive Model. It relies on the fused lasso method to segment the genome by estimating regions in which methylation is constant to study DNA methylation patterns in a single condition or capture dynamic changes of DNA methylation levels across conditions.
MethyLasso identifies low-methylated regions (LMRs), unmethylated regions (UMRs), DNA methylation valleys (DMVs) and partially methylated domains (PMDs) in a single condition as well as differentially methylated regions (DMRs) between two conditions.
By Delphine Balaramane, Yannick G Spill, Michaël Weber & Anaïs F Bardet.
Publication: Nucleic Acid Research 2024 https://doi.org/10.1093/nar/gkae880
Dependencies:
- Requires R 3.6 or greater
- Requires R packages: Rcpp, RcppEigen, methods,scales, Matrix, matrixStats, data.table, ggplot2, foreach, doParallel, stringr, R.utils
- Requires conda (if you do the conda installation)
Conda installation:
unzip methylasso-main.zip
conda env create -f methylasso-main/conda_env.yml
conda activate MethyLasso
R CMD INSTALL --preclean methylasso-main
Installation without conda:
unzip methylasso-main.zip
R CMD INSTALL --preclean methylasso-main
Using data from two different conditions:
Rscript MethyLasso.R --n1 [name1] --c1 [condition1_replicate1,condition1_replicate2,..] --n2 [name2] --c2 [condition2_replicate1,condition2_replicate2,..] [options]
Using data from only one condition (not calling DMRs):
Rscript MethyLasso.R --n1 [name1] --c1 [condition1_replicate1,condition1_replicate2,..] [options]
| argument | description |
|---|---|
| --n1 | Name of condition 1 (used in output file names) |
| --c1 | Input file(s) for condition 1 (separated by comma without space for replicates) |
| --n2 | Name of condition 2 (used in output file names and corresponding to reference condition for DMR identification) |
| --c2 | Input file(s) for condition 2 (separated by comma without space for replicates) |
By Default, Methylasso takes a tab-delimited file containing for each CpG:
chr / start / end / percent_methylation / count_methylated / count_unmethylated
This file can be generated within the Bismark pipeline with the command Coverage2Cytosine with option --mergeCpG.
Alternatively, files with different formats can be provided by specifying the columns. For each CpG, their genomic coordinates should be in the first three columns (chr / start / end) and columns containing the counts of methylated and unmethylated Cs OR total count and methylation percentage should be specified.
| argument | description |
|---|---|
| --mC | Column number containing count of methylated Cs |
| --uC | Column number containing count of unmethylated Cs |
| OR | |
| --cov | Column number containing total count |
| --meth | Column number containing percentage of methylation [0-1] |
| argument | description |
|---|---|
| -c | Minimum coverage per CpG (default 5) |
| -n | Minimum number of CpGs in LMRs, UMRS, DMVs and DMRs (default 4) |
| -m | Maximum mean methylation for PMDs (default 0.7) |
| -d | Minimum methylation difference for DMRs [0-1] (default 0.1) |
| -p | P-value threshold for DMRs (default 0.05) |
| -q | Q-value (FDR) threshold for DMRs (to be used instead of p-value threshold) |
| -r | Replicate coverage score for CpGs in DMRs (default 0.7) |
| -s | Skip LMR, UMR, DMV and PMD identification (default FALSE) |
| -t | Number of threads to use (default 1) |
| -o | Output directory (default current directory) |
| -f | Output figures in pdf format (default TRUE) |
| --quiet | Do not print processing information (default FALSE) |
| --version | Print version |
| --help | Print help |
| Argument | Description |
|---|---|
| --umrlmr_lambda2 | Segmentation value for UMR/LMR identification (default 25) |
| --tol_val | Tolerance value for optimization (default 0.01) |
| --pmd_lambda2 | Segmentation value for PMD identification (default 1000) |
| --pmd_std_threshold | Standard deviation threshold for PMD (default 0.15) |
| --umr_std_threshold | Standard deviation threshold for UMR (default 0.1) |
| --umr_max_beta | Maximum methylation for UMR (default 0.1) |
| --lmr_max_beta | Maximum methylation for LMR (default 0.5) |
| --valley_max_beta | Maximum methylation for valleys (default 0.1) |
| --pmd_valley_min_width | Minimum width for PMD valleys (default 5000) |
| --max_distance | Maximum distance for merging segments (default 500) |
| --min_width | Minimum width for segments (default 30) |
| --umr_large_width | Large width threshold for UMRs (default 500) |
| --umr_large_density | Density threshold for large UMRs (default 0.03) |
| --dmr_lambda2 | Segmentation value for DMR identification (default 25) |
| --min.overlap.pc | Minimum overlap percentage for DMRs (default 10) |
| --flank.width | Minimum width of flanking segments (default 300) |
| --flank.dist.max | Maximum distance of UMR/LMR to flanking segment (default 5000) |
| --flank.num.cpgs | Consider as small UMR/LMRs those which have num.cpgs smaller than this threshold (default 10) |
| --flank.beta.min | For small UMR/LMRs, require that flanking segments have mean methylation above this threshold (default 0.5) |
| --split.pmds | Whether to split PMDs which contain UMRs (default TRUE) |
| --merge.pmds | Whether to merge adjacent PMDs into a single region (default TRUE) |
| --drop | Whether to drop verbose columns and unannotated segments (default TRUE) |
Two histograms are generated for each replicate in each condition in a pdf file name_replicate_plots.pdf containing: a distribution of the sequencing depth and distribution of the methylation of CpGs obtained after applying the minimum coverage per CpG (option -m).
For the LMR/UMR/DMV/PMD part, two tab-delimited files will be generated for each condition name_lmr_umr_dmv.tsv and name_pmd.tsv containing: chr / start / end / number of CpGs / methylation / standard deviation / category
After PMDs identification, a smooth scatter plot containing the standart deviation and mean methylation of each PMDs is generated in a pdf file for each condition name1_pmds_plot.pdf and name2_pmds_plot.pdf.
For the DMR part, one tab-delimited file will be generated name1_vs_name2_dmrs.tsv containing: chr / start / end / number of CpGs condition 1 / number of CpGs condition 2 / replicate coverage score / mean methylation 1 / mean methylation 2 / methylation difference / p-value / q-value / annotation If the first part is not run (option -s), the last annotation column will not be added.
After DMRs identification, a smooth scatter plot containing the mean methylation 1 and mean methylation 2 of each DMRs is generated in a pdf file name1_vs_name2_dmrs_plot.pdf.
An example of command line to run MethyLasso is provided using files from Hansen et al. 2011 comparing healthy and colon cancer conditions with 3 replicates for each (healthy_1, healthy_2, healthy_3, cancer_1, cancer_2, cancer_3). The input files contain for each CpG , their genomic coordinates in the first three columns (chr / start / end), the total Count of Cs in columns four (--cov 4) and the percentage of methylation in columns 5 (--meth 5). DMRs will be identified with a q-value threshold 0.05 (-q 0.05).
Rscript MethyLasso.R --n1 healthy --c1 WGBS_healthy_1_meth.bed.gz,WGBS_healthy_2_meth.bed.gz,WGBS_healthy_3_meth.bed.gz --n2 cancer --c2 WGBS_cancer_1_meth.bed.gz,WGBS_cancer_2_meth.bed.gz,WGBS_cancer_3_meth.bed.gz --cov 4 --meth 5 -q 0.05
Output files obtained by running this command are LMRs/UMRs/DMVs file healthy and cancer , PMDs file for healthy and cancer and DMRs file cancer vs healthy.
Also plots containing information on each input files are generated for healthy condition:

A plot on PMDs for each condition, where the user can see what threshold to take if he wants to recover more methylated PMDs and change the -m option.

And a plot on DMRs methylation in cancer and healthy conditions.



