GitHub - tdhock/PeakSegFPOP-paper: Functional pruning optimal partioning algorithm for PeakSeg constrained segmentation model

This repository contains the source code for our papers describing log-linear time algorithms for solving the PeakSeg constrained changepoint detection problem.

arXiv:1703.03352 A log-linear segmentation algorithm for peak detection in genomic data

Updated pre-print: jmlr-paper.pdf, jmlr-supplementary.pdf

The labeled data we used to compute peak detection accuracy are http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/

The fold IDs that we used in our four-fold cross-validation experiment to compute test AUC/accuracy are listed in http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/4foldcv-test-folds.csv

To recompute the figures from the arXiv paper, execute the code in the following R scripts.

LaTeX source: arxiv-paper.tex
Figure 1: figure-compare-unconstrained.R
Figure 2: figure-2-min-envelope.R
Figure 3: figure-PDPA-intervals.R and figure-PDPA-timings.R, make figure-PDPA-interals.png and make figure-PDPA-timings.pdf
Figure 4: figure-test-error-dots.R, make figure-test-error-dots.pdf

To recompute other figures:

Note that the Makefile shows how to re-make some intermediate RData files based on the ChIP-seq benchmark data.

arXiv:1810.00117 Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data

The benchmark data we used in this paper are much larger than the previous paper. To re-compile the figures and paper you can type “make jss-paper.pdf”

Sweave/LaTeX source: jss-paper.Rnw
Figures 3 (concave G function we maximize to find the model with P peaks) and 12 (time complexity of computing sqrt N peaks): jss-figure-evaluations.R
Figure 4 (PeakSegFPOP can be used to compute more likely models than the default MACS2 model): jss-figure-more-likely-models.R
Figure 5 (up-down constrained changepoint model is robust to spatial correlation): jss-figure-spatial-correlation.R
Figure 6 (number of intervals is log N and time/space complexity is N log N): jss-figure-target-intervals-models.R
Figure 7 (disk-based storage is a constant factor slower than memory-based storage, about 2x): jss-figure-disk-memory-compare-speed.R
Figure 8 (number of DP iterations using OP is much lower than SN to compute a large number of peaks): jss-figure-variable-peaks.R
Figure 9 (number of GFPOP calls depends on maximum number of peaks): jss-figure-more-evals.R
Figure 10 (label error): jss-figure-label-error.R
Figure 11 (N genomic data have sqrt N peaks): jss-figure-data-peaks.R

TODOs

add figure files, figure-good-bad.R
new slides showing better peak detection specific examples figure-min-train-error.R
big model labels in figure-test-error-dots.R

12 Aug 2022

2022-06-20-paris-time-complexity.tex makes 2022-06-20-paris-time-complexity.pdf slides with summary of results from three recent papers for talk in Paris.

Updated: 2022-09-28-tucson-slides.tex makes 2022-09-28-tucson-slides.pdf

Title: Time complexity analysis of recently proposed algorithms for optimal changepoint detection

Slides: https://github.com/tdhock/PeakSegFPOP-paper/blob/master/2022-09-28-tucson-slides.pdf

Abstract: Several dynamic programming algorithms have been recently proposed for efficiently solving various problems related to optimal changepoint detection in large data sequences measured over space or time. In this talk we will discuss three recent papers which each provide a theoretical or empirical time complexity analysis. Our time complexity analysis shows that these linear and log-linear time algorithms can be used for analysis of huge data sequences, with ten million observations or more, which are now common in areas such as genomics.

24 July 2019

jss-figure-more-evals.R creates

2 July 2019

2019-useR-slides.Rnw for useR conf.

22 July 2019

jss.more.evals.R attempts to answer the question from a reviewer: does the number of GFPOP evals depend on the max number of peaks?

18 Oct 2018

new scheme for PDPA.infeasible.R: for background segments with the same mean either before or after, join peaks before and after. Before this file was using the small peaks in these infeasible models. Contrast with PDPA.peaks.R which simply ignores infeasible models.
new file PDPA.infeasible.error.compare.R which compares the label error of the two PDPA models with the CDPA model. Theoretically the PDPA.infeasible model with new join scheme should have more true positives and false positives than the PDPA.peaks model which simply ignores infeasible models. This file also shows that it reduces the min train error to be similar with CDPA.

28 Mar 2017

PDPA.timings.small.R runs the PeakSegPDPA algorithm for each data set in the McGill ChIP-seq benchmark, and only saves info for the model up to the last data point. In constrast, the original PDPA.timings.R creates much larger PDPA.model.RData files, since it stores cost/data/intervals for all data points.
PDPA.targets.R computes target intervals for the PeakSegPDPA models, and problem.features.R computes features.

7 Feb 2017

Ideas about how to define a state graph for a solver of the Optimal Partitioning problem with affine constraints between adjacent segment means.

library(data.table)
### Define an affine constraint function
### g(from, to) = from.coef*from + to.coef*to + constant
### to be used as g(from, to) <= 0,
### e.g. from <= to is coded as 1*from -1*to + 0 <= 0.
affine.constraint <- function(from.coef, to.coef, constant){
  data.table(from.coef, to.coef, constant)
}

no.constraint <- affine.constraint(0, 0, 0)
non.increasing <- affine.constraint(-1, 1, 0)
non.decreasing <- affine.constraint(1, -1, 0)

state <- function(state.name){
  structure(data.table(state.name), type="state")
}

change <- function(from, to, constraint, penalty=NA){
  structure(data.table(from, to, penalty, constraint), type="change")
}

loss <- function(loss.name){
  structure(data.table(loss.name), type="loss")
}

start <- function(...){
  structure(data.table(state.name=c(...)), type="start")
}

end <- function(...){
  structure(data.table(state.name=c(...)), type="end")
}

unconstrained <- list(
  state("anything"),
  change("anything", "anything", no.constraint))

unconstrained.Gaussian <- c(unconstrained, list(
  loss("Gaussian")))

unconstrained.Poisson <- c(unconstrained, list(
  loss("Poisson")))

PeakSegFPOP <- list(
  loss("Poisson"),
  state("peak"),
  state("background"),
  start("background"),
  change("background", "peak", non.decreasing, penalty=0),
  change("peak", "background", non.increasing),
  end("background"))

PeakSegFPOP.start.or.end.up <- list(
  loss("Poisson"),
  state("peak"),
  state("background"),
  change("background", "peak", non.decreasing),
  change("peak", "background", penalty, non.increasing))

reduced.isotonic.regression <- list(
  state("anything"),
  change("anything", "anything", non.decreasing))

unimodal.regression <- list(
  state("can.change.up.or.down"),
  state("can.change.down"),
  change("can.change.up.or.down", "can.change.up.or.down", non.decreasing),
  change("can.change.up.or.down", "can.change.down", non.increasing),
  change("can.change.down", "can.change.down", non.increasing))

unimodal.at.least.one.up <- c(unimodal.regression, list(
  state("start"),
  start("start"),
  change("start", "can.change.up.or.down", non.decreasing)))

unimodal.at.least.one.up.and.down <- c(unimodal.at.least.one.up, list(
  end("can.change.down")))

checkModel <- function(model.list){
  type.vec <- sapply(model.list, attr, "type")
  model.info <- sapply(unique(type.vec), function(type){
    do.call(rbind, model.list[type.vec==type])
  })
  ## TODO error checking.
  model.info
}
checkModel(unimodal.at.least.one.up.and.down)
checkModel(PeakSegFPOP)

## TODO functions for plotting, solving.
GFPOP(model, data.vec, weight.vec, penalty)

26 Jan 2017

Guillaume’s group meeting presentation slides http://members.cbio.mines-paristech.fr/~thocking/HOCKING-PeakSegFPOP-pipeline-slides.pdf

10 Nov 2016

Data viz with smooth transitions, clarified titles.

Revised interactive data viz.

9 Nov 2016

Interactive data viz to explain supervised penalty learning for peaks.

15 Aug 2016

Test accuracy and AUC data viz, explains why Segmentor gets such a high test accuracy (it has a low true positive and false positive rate) http://bl.ocks.org/tdhock/raw/886575874144c3b172ce6b7d7d770b9f/

10 Aug 2016

Slides for group meeting presentation 11 Aug 2016.
http://bl.ocks.org/tdhock/raw/b796b4be10aa431575bb01ec16035b23/ shows min env in addition to min/less more computation.

3 Aug 2016

C++ algo implemented in coseg package.
figure-PeakSegPDPA-demo.R created http://bl.ocks.org/tdhock/raw/8c5dd0af533e24a893e7c5232f9bc94c/ using average loss instead of total loss.

13 May 2016

figure-cDPA-PDPA-all.R visualizes the optimality and feasibility of the PDPA and cDPA models, and shows the interval counts in the PDPA http://bl.ocks.org/tdhock/raw/4582904f843cc60639fdfeb9651cac73/

12 May 2016

figure-cDPA-PDPA.R shows the difference between the cDPA and PDPA on real data: the cDPA recovers a sub-optimal solution that obeys the strict inequality peak constraint, and the PDPA recovers the optimal solution for the non-strict inequality peak constraint. http://bl.ocks.org/tdhock/raw/24aa6387901edab1577ce24f1e736ff3/

10 May 2016

figure-constrained-PDPA-normal-real.R makes http://cbio.ensmp.fr/~thocking/figure-constrained-PDPA-normal-real/ a data viz which shows the constrained algorithm up to 5 segments for a data set with 121 points.

4 May 2016

figure-constrained-PDPA-normal-panels.R implements the constrained PDPA algo with two kinds of min-less/min-more operators, inspired by two kinds of inequality constraints (strict and not). Visualization of running the algos up to 3 segments on 5 data sets with 4 data points each: http://bl.ocks.org/tdhock/raw/e924d180dda5d0cd1da8e8f556e741b7/
figure-unconstrained-PDPA-normal.R implements the unconstrained PDPA and visualizes the functional cost model and pruning http://cbio.ensmp.fr/~thocking/figure-unconstrained-PDPA-normal-big/

Name		Name	Last commit message	Last commit date
Latest commit History 433 Commits
data		data
jss-figure-label-error		jss-figure-label-error
jss-figure-more-likely-models		jss-figure-more-likely-models
jss-reproducible		jss-reproducible
.gitignore		.gitignore
2019-12-ml-lab-sedona-cropped-lores.jpeg		2019-12-ml-lab-sedona-cropped-lores.jpeg
2019-useR-slides.Rnw		2019-useR-slides.Rnw
2021-03-lab-ski-lunch.jpg		2021-03-lab-ski-lunch.jpg
2022-06-20-paris-time-complexity.pdf		2022-06-20-paris-time-complexity.pdf
2022-06-20-paris-time-complexity.tex		2022-06-20-paris-time-complexity.tex
2022-09-28-tucson-slides.pdf		2022-09-28-tucson-slides.pdf
2022-09-28-tucson-slides.tex		2022-09-28-tucson-slides.tex
CVXR.R		CVXR.R
Chromatin_immunoprecipitation_sequencing_wide.png		Chromatin_immunoprecipitation_sequencing_wide.png
HOCKING-Jan2018-results.tex		HOCKING-Jan2018-results.tex
HOCKING-PeakSeg-functional-pruning-slides.tex		HOCKING-PeakSeg-functional-pruning-slides.tex
HOCKING-PeakSegFPOP-pipeline-slides.tex		HOCKING-PeakSegFPOP-pipeline-slides.tex
HOCKING-Poisson-ratio-inequality-constraints.tex		HOCKING-Poisson-ratio-inequality-constraints.tex
HOCKING-RIGAILL-constrained-functional-pruning.tex		HOCKING-RIGAILL-constrained-functional-pruning.tex
HOCKING-notes.tex		HOCKING-notes.tex
Makefile		Makefile
PDPA-infeasible-error-compare.tex		PDPA-infeasible-error-compare.tex
PDPA.cDPA.compare.R		PDPA.cDPA.compare.R
PDPA.infeasible.R		PDPA.infeasible.R
PDPA.infeasible.error.R		PDPA.infeasible.error.R
PDPA.infeasible.error.compare.R		PDPA.infeasible.error.compare.R
PDPA.intervals.R		PDPA.intervals.R
PDPA.intervals.all.R		PDPA.intervals.all.R
PDPA.microbenchmark.R		PDPA.microbenchmark.R
PDPA.missing.R		PDPA.missing.R
PDPA.model.check.R		PDPA.model.check.R
PDPA.peaks.R		PDPA.peaks.R
PDPA.peaks.error.R		PDPA.peaks.error.R
PDPA.targets.R		PDPA.targets.R
PDPA.timings.R		PDPA.timings.R
PDPA.timings.one.R		PDPA.timings.one.R
PDPA.timings.several.R		PDPA.timings.several.R
PDPA.timings.small.R		PDPA.timings.small.R
PDPAInf.intervals.R		PDPAInf.intervals.R
PDPAInf.timings.R		PDPAInf.timings.R
PeakSegJoint-monocytes-up.png		PeakSegJoint-monocytes-up.png
PeakSegPDPA.R		PeakSegPDPA.R
README.org		README.org
Screenshot-interval-regression.png		Screenshot-interval-regression.png
Screenshot-joint-interval-coverage.png		Screenshot-joint-interval-coverage.png
Screenshot-joint-interval-penalty.png		Screenshot-joint-interval-penalty.png
Screenshot-max-margin.png		Screenshot-max-margin.png
SegAnnDB-test-error-decreases.png		SegAnnDB-test-error-decreases.png
Seg_AvecC1.png		Seg_AvecC1.png
Seg_SansC.png		Seg_SansC.png
Segmentor.infeasible.error.R		Segmentor.infeasible.error.R
Segmentor.peaks.error.R		Segmentor.peaks.error.R
Segmentor.timings.R		Segmentor.timings.R
Sweave.sty		Sweave.sty
all.cv.R		all.cv.R
all.features.R		all.features.R
all.modelSelection.R		all.modelSelection.R
all.targets.R		all.targets.R
aoas-paper.tex		aoas-paper.tex
aoas-supplementary.tex		aoas-supplementary.tex
ar1.tex		ar1.tex
arxiv-paper.tex		arxiv-paper.tex
blueprint.genotypes.R		blueprint.genotypes.R
blueprint.genotypes.peaks.R		blueprint.genotypes.peaks.R
bug.R		bug.R
cosegData.timings.R		cosegData.timings.R
dp.peaks.NA.R		dp.peaks.NA.R
dp.peaks.R		dp.peaks.R
dp.peaks.error.R		dp.peaks.error.R
dp.peaks.matrices.R		dp.peaks.matrices.R
dp.peaks.sets.R		dp.peaks.sets.R
dp.timings.R		dp.timings.R
dp.timings.reverse.R		dp.timings.reverse.R
figure-1-min-less-operator.tex		figure-1-min-less-operator.tex
figure-1-min-more-operator.tex		figure-1-min-more-operator.tex
figure-1-min-operators.R		figure-1-min-operators.R
figure-2-min-envelope.R		figure-2-min-envelope.R
figure-2-min-envelope.tex		figure-2-min-envelope.tex
figure-2d.R		figure-2d.R
figure-CDPA-fails.R		figure-CDPA-fails.R
figure-CDPA-fails.tex		figure-CDPA-fails.tex
figure-NA-timings.R		figure-NA-timings.R
figure-PDPA-cDPA-compare.R		figure-PDPA-cDPA-compare.R
figure-PDPA-infeasible-error-compare.R		figure-PDPA-infeasible-error-compare.R
figure-PDPA-infeasible-error-compare.pdf		figure-PDPA-infeasible-error-compare.pdf
figure-PDPA-intervals-all.R		figure-PDPA-intervals-all.R
figure-PDPA-intervals-log-log.tex		figure-PDPA-intervals-log-log.tex
figure-PDPA-intervals-small.tex		figure-PDPA-intervals-small.tex
figure-PDPA-intervals.R		figure-PDPA-intervals.R
figure-PDPA-intervals.png		figure-PDPA-intervals.png
figure-PDPA-microbenchmark.R		figure-PDPA-microbenchmark.R
figure-PDPA-model-check.R		figure-PDPA-model-check.R
figure-PDPA-timings-log-log.tex		figure-PDPA-timings-log-log.tex
figure-PDPA-timings-small.tex		figure-PDPA-timings-small.tex
figure-PDPA-timings-wide-labels.R		figure-PDPA-timings-wide-labels.R
figure-PDPA-timings.R		figure-PDPA-timings.R
figure-PDPAInf-intervals.R		figure-PDPAInf-intervals.R
figure-PeakError.pdf		figure-PeakError.pdf
figure-PeakSeg-macs-peak-size-comparison.png		figure-PeakSeg-macs-peak-size-comparison.png
figure-PeakSeg.R		figure-PeakSeg.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

arXiv:1703.03352 A log-linear segmentation algorithm for peak detection in genomic data

arXiv:1810.00117 Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data

TODOs

12 Aug 2022

24 July 2019

2 July 2019

22 July 2019

18 Oct 2018

28 Mar 2017

7 Feb 2017

26 Jan 2017

10 Nov 2016

9 Nov 2016

15 Aug 2016

10 Aug 2016

3 Aug 2016

13 May 2016

12 May 2016

10 May 2016

4 May 2016

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

tdhock/PeakSegFPOP-paper

Folders and files

Latest commit

History

Repository files navigation

arXiv:1703.03352 A log-linear segmentation algorithm for peak detection in genomic data

arXiv:1810.00117 Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data

TODOs

12 Aug 2022

24 July 2019

2 July 2019

22 July 2019

18 Oct 2018

28 Mar 2017

7 Feb 2017

26 Jan 2017

10 Nov 2016

9 Nov 2016

15 Aug 2016

10 Aug 2016

3 Aug 2016

13 May 2016

12 May 2016

10 May 2016

4 May 2016

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages