This repository contains the source code for our papers describing log-linear time algorithms for solving the PeakSeg constrained changepoint detection problem.
arXiv:1703.03352 A log-linear segmentation algorithm for peak detection in genomic data
Updated pre-print: jmlr-paper.pdf, jmlr-supplementary.pdf
The labeled data we used to compute peak detection accuracy are http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/
The fold IDs that we used in our four-fold cross-validation experiment to compute test AUC/accuracy are listed in http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/4foldcv-test-folds.csv
To recompute the figures from the arXiv paper, execute the code in the following R scripts.
- LaTeX source: arxiv-paper.tex
- Figure 1: figure-compare-unconstrained.R
- Figure 2: figure-2-min-envelope.R
- Figure 3: figure-PDPA-intervals.R and
figure-PDPA-timings.R,
make figure-PDPA-interals.pngandmake figure-PDPA-timings.pdf - Figure 4: figure-test-error-dots.R,
make figure-test-error-dots.pdf
To recompute other figures:
Note that the Makefile shows how to re-make some intermediate RData files based on the ChIP-seq benchmark data.
arXiv:1810.00117 Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data
The benchmark data we used in this paper are much larger than the previous paper. To re-compile the figures and paper you can type “make jss-paper.pdf”
- Sweave/LaTeX source: jss-paper.Rnw
- Figures 3 (concave G function we maximize to find the model with P peaks) and 12 (time complexity of computing sqrt N peaks): jss-figure-evaluations.R
- Figure 4 (PeakSegFPOP can be used to compute more likely models than the default MACS2 model): jss-figure-more-likely-models.R
- Figure 5 (up-down constrained changepoint model is robust to spatial correlation): jss-figure-spatial-correlation.R
- Figure 6 (number of intervals is log N and time/space complexity is N log N): jss-figure-target-intervals-models.R
- Figure 7 (disk-based storage is a constant factor slower than memory-based storage, about 2x): jss-figure-disk-memory-compare-speed.R
- Figure 8 (number of DP iterations using OP is much lower than SN to compute a large number of peaks): jss-figure-variable-peaks.R
- Figure 9 (number of GFPOP calls depends on maximum number of peaks): jss-figure-more-evals.R
- Figure 10 (label error): jss-figure-label-error.R
- Figure 11 (N genomic data have sqrt N peaks): jss-figure-data-peaks.R
- add figure files, figure-good-bad.R
- new slides showing better peak detection specific examples figure-min-train-error.R
- big model labels in figure-test-error-dots.R
2022-06-20-paris-time-complexity.tex makes 2022-06-20-paris-time-complexity.pdf slides with summary of results from three recent papers for talk in Paris.
Updated: 2022-09-28-tucson-slides.tex makes 2022-09-28-tucson-slides.pdf
Title: Time complexity analysis of recently proposed algorithms for optimal changepoint detection
Abstract: Several dynamic programming algorithms have been recently proposed for efficiently solving various problems related to optimal changepoint detection in large data sequences measured over space or time. In this talk we will discuss three recent papers which each provide a theoretical or empirical time complexity analysis. Our time complexity analysis shows that these linear and log-linear time algorithms can be used for analysis of huge data sequences, with ten million observations or more, which are now common in areas such as genomics.
jss-figure-more-evals.R creates
2019-useR-slides.Rnw for useR conf.
jss.more.evals.R attempts to answer the question from a reviewer: does the number of GFPOP evals depend on the max number of peaks?
- new scheme for PDPA.infeasible.R: for background segments with the same mean either before or after, join peaks before and after. Before this file was using the small peaks in these infeasible models. Contrast with PDPA.peaks.R which simply ignores infeasible models.
- new file PDPA.infeasible.error.compare.R which compares the label error of the two PDPA models with the CDPA model. Theoretically the PDPA.infeasible model with new join scheme should have more true positives and false positives than the PDPA.peaks model which simply ignores infeasible models. This file also shows that it reduces the min train error to be similar with CDPA.
- PDPA.timings.small.R runs the PeakSegPDPA algorithm for each data set in the McGill ChIP-seq benchmark, and only saves info for the model up to the last data point. In constrast, the original PDPA.timings.R creates much larger PDPA.model.RData files, since it stores cost/data/intervals for all data points.
- PDPA.targets.R computes target intervals for the PeakSegPDPA models, and problem.features.R computes features.
Ideas about how to define a state graph for a solver of the Optimal Partitioning problem with affine constraints between adjacent segment means.
library(data.table)
### Define an affine constraint function
### g(from, to) = from.coef*from + to.coef*to + constant
### to be used as g(from, to) <= 0,
### e.g. from <= to is coded as 1*from -1*to + 0 <= 0.
affine.constraint <- function(from.coef, to.coef, constant){
data.table(from.coef, to.coef, constant)
}
no.constraint <- affine.constraint(0, 0, 0)
non.increasing <- affine.constraint(-1, 1, 0)
non.decreasing <- affine.constraint(1, -1, 0)
state <- function(state.name){
structure(data.table(state.name), type="state")
}
change <- function(from, to, constraint, penalty=NA){
structure(data.table(from, to, penalty, constraint), type="change")
}
loss <- function(loss.name){
structure(data.table(loss.name), type="loss")
}
start <- function(...){
structure(data.table(state.name=c(...)), type="start")
}
end <- function(...){
structure(data.table(state.name=c(...)), type="end")
}
unconstrained <- list(
state("anything"),
change("anything", "anything", no.constraint))
unconstrained.Gaussian <- c(unconstrained, list(
loss("Gaussian")))
unconstrained.Poisson <- c(unconstrained, list(
loss("Poisson")))
PeakSegFPOP <- list(
loss("Poisson"),
state("peak"),
state("background"),
start("background"),
change("background", "peak", non.decreasing, penalty=0),
change("peak", "background", non.increasing),
end("background"))
PeakSegFPOP.start.or.end.up <- list(
loss("Poisson"),
state("peak"),
state("background"),
change("background", "peak", non.decreasing),
change("peak", "background", penalty, non.increasing))
reduced.isotonic.regression <- list(
state("anything"),
change("anything", "anything", non.decreasing))
unimodal.regression <- list(
state("can.change.up.or.down"),
state("can.change.down"),
change("can.change.up.or.down", "can.change.up.or.down", non.decreasing),
change("can.change.up.or.down", "can.change.down", non.increasing),
change("can.change.down", "can.change.down", non.increasing))
unimodal.at.least.one.up <- c(unimodal.regression, list(
state("start"),
start("start"),
change("start", "can.change.up.or.down", non.decreasing)))
unimodal.at.least.one.up.and.down <- c(unimodal.at.least.one.up, list(
end("can.change.down")))
checkModel <- function(model.list){
type.vec <- sapply(model.list, attr, "type")
model.info <- sapply(unique(type.vec), function(type){
do.call(rbind, model.list[type.vec==type])
})
## TODO error checking.
model.info
}
checkModel(unimodal.at.least.one.up.and.down)
checkModel(PeakSegFPOP)
## TODO functions for plotting, solving.
GFPOP(model, data.vec, weight.vec, penalty)
Guillaume’s group meeting presentation slides http://members.cbio.mines-paristech.fr/~thocking/HOCKING-PeakSegFPOP-pipeline-slides.pdf
Data viz with smooth transitions, clarified titles.
Interactive data viz to explain supervised penalty learning for peaks.
Test accuracy and AUC data viz, explains why Segmentor gets such a high test accuracy (it has a low true positive and false positive rate) http://bl.ocks.org/tdhock/raw/886575874144c3b172ce6b7d7d770b9f/
- Slides for group meeting presentation 11 Aug 2016.
- http://bl.ocks.org/tdhock/raw/b796b4be10aa431575bb01ec16035b23/ shows min env in addition to min/less more computation.
- C++ algo implemented in coseg package.
- figure-PeakSegPDPA-demo.R created http://bl.ocks.org/tdhock/raw/8c5dd0af533e24a893e7c5232f9bc94c/ using average loss instead of total loss.
figure-cDPA-PDPA-all.R visualizes the optimality and feasibility of the PDPA and cDPA models, and shows the interval counts in the PDPA http://bl.ocks.org/tdhock/raw/4582904f843cc60639fdfeb9651cac73/
figure-cDPA-PDPA.R shows the difference between the cDPA and PDPA on real data: the cDPA recovers a sub-optimal solution that obeys the strict inequality peak constraint, and the PDPA recovers the optimal solution for the non-strict inequality peak constraint. http://bl.ocks.org/tdhock/raw/24aa6387901edab1577ce24f1e736ff3/
- figure-constrained-PDPA-normal-real.R makes http://cbio.ensmp.fr/~thocking/figure-constrained-PDPA-normal-real/ a data viz which shows the constrained algorithm up to 5 segments for a data set with 121 points.
- figure-constrained-PDPA-normal-panels.R implements the constrained PDPA algo with two kinds of min-less/min-more operators, inspired by two kinds of inequality constraints (strict and not). Visualization of running the algos up to 3 segments on 5 data sets with 4 data points each: http://bl.ocks.org/tdhock/raw/e924d180dda5d0cd1da8e8f556e741b7/
- figure-unconstrained-PDPA-normal.R implements the unconstrained PDPA and visualizes the functional cost model and pruning http://cbio.ensmp.fr/~thocking/figure-unconstrained-PDPA-normal-big/
