Carview!

Experiments to support upcoming research papers about binary segmentation.

Talk slides

slides PDF.
arXiv: https://arxiv.org/abs/2410.08654

Finite Sample Complexity Analysis of Binary Segmentation

Toby Dylan Hocking

Binary segmentation is the classic greedy algorithm which recursively splits a sequential data set by optimizing some loss or likelihood function. Binary segmentation is widely used for change-point detection in data sets measured over space or time, and as a sub-routine for decision tree learning. In theory, using a simple loss function like Gaussian or Poisson, its asymptotic time complexity should be extremely fast: for $N$ data and $K$ segments, $O(N K)$ in the worst case, and $O(N log K)$ in the best case. In practice, existing implementations can be asymptotically slower, and can sometimes return incorrect results. We present the R package binsegRcpp, which provides an efficient and correct C++ implementation of binary segmentation. We propose new methods for analyzing the best/worst case number of iterations of the algorithm, as well as empirical analyses which indicate that binsegRcpp has asymptotically optimal speed in practice.

Paper: Comparing binsegRcpp with other implementations of binary segmentation for change-point detection

PDF submitted to JSS.

Abstract: Binary segmentation is a classic algorithm for detecting change-points in sequential data. In theory, using a simple loss function like Normal or Poisson, binary segmentation should be extremely fast for N data and S segments: asymptotically O(N S) time in the worst case, and O(N log S) time in the best case. In practice, existing implementations can be asymptotically slower, and can return incorrect results. We propose binsegRcpp, an R package which provides a correct C++ implementation, with the expected asymp- totic time complexity. We discuss several important C++ coding techniques, and include detailed comparisons with other implementations of binary segmentation: ruptures, fpop, changepoint, blockcpd, and wbs.

Replication materials: TGZ with data and R script for JSS.

Figure 1 (models from 2 to 4 segments) and manuscript code/output: source article.Rnw, PDF.
Figure 2 (DNA copy number example): R code, PNG.
Figure 3 (cross-validation): R code, PDF.
Figures 4-6 (list vs multiset, changepoint vs ruptures, throughput): R timing code, R figure code, Figure 4 PNG, Figure 4 tex, Figure 5 PNG, Figure 5 tex, Figure 6 PNG.
Figure 7 (square loss): R timing code, R figure code, PNG.
Figure 8 (normal change in mean and variance): R timing code, R figure code, PNG.
Figure 9 (Poisson): R timing code, R figure code, PNG.
Figure 10 (L1): R timing code, R figure code, PNG.
Figure 11 (compare distributions): R figure code, PNG.

Paper: Finite sample complexity analysis of binary segmentation

arXiv: https://arxiv.org/abs/2410.08654

To reproduce the analyses and figures in the paper you can use the code below:

Figure 1: synthetic data which require tie-breaking. R code, PNG figure.
Figure 2: optimal binary trees. R code, PNG figure.
Figure 3: number of candidate splits considered in real genomic ChIP-seq data from McGill benchmark. R code 1 makes CSV file, R code 2 makes PNG figure.
Figure 4: number of candidate splits considered in real genomic copy number data from neuroblastoma benchmark. R code 1 makes CSV file, R code 2 makes PNG figure.

TODOs

Compare with Julia https://github.com/STOR-i/Changepoints.jl/blob/master/src/BS.jl maybe using https://cran.r-project.org/web/packages/JuliaCall/readme/README.html
MATLAB https://www.mathworks.com/help/signal/ref/findchangepts.html#mw_748d5529-e05a-4b55-a3fe-2a12a5772d22
NAG licensing and installation too complicated, https://www.nag.com/numeric/nl/nagdoc_latest/clhtml/g13/g13ndc.html, https://www.nag.com/content/calling-nag-fortran-routines-r, https://www.nag.com/content/nagfwrappers-r-package
Try ruptures cumsum deepcharles/ruptures#245 (comment)

2 May 2025

changepoint.bug.R makes figure for rkillick/changepoint#86

7 Aug 2022

figure-compare-ruptures.R makes the figure below which plots asymptotic reference lines on top of the empirical median timing,

The figure above shows that

first column: changepoint is clearly cubic,
second/third columns: ruptures is clearly quadratic in worst case, and best case seems to be somewhere between quadratic and log-linear.

12 July 2022

figure-timings-versions-data.R makes figure-timings-versions-data.csv

figure-timings-versions.R reads that and makes

17 May 2022

figure-synthetic.R makes

4 May 2022

figure-neuroblastoma-iterations-data.R makes figure-neuroblastoma-iterations-data.csv and figure-neuroblastoma-iterations-bounds.csv

figure-neuroblastoma-iterations.R makes

21 Apr 2022

figure-mcgill-iterations-data.R makes figure-mcgill-iterations-data.csv

figure-mcgill-iterations.R reads that and makes

Figure above shows that in real data achieves asymptotic best case complexity.

13 Apr 2022

figure-best-heuristics.R makes figures below

figure-optimal-trees.R makes small pictures for slides like this

and big picture below

11 Apr 2022

figure-compare-distributions.R makes

Figure above shows small asymptotic slowdown for Laplace median distribution.

figure-splits-loss.R makes new figure

which shows some empirical cumulative counts smaller than the best, which is possible since the “best” case time complexity bound refers to the total number of splits at the end, not at each iteration.

6 Apr 2022

figure-splits-loss.R makes

Figure above shows that 0/1 seq can be used for worst case of l1,mean_norm,poisson loss, but for meanvar_norm we need 0/1/10/11 seq. Also linear data achieves best case for normal losses, and is very close to best case for l1 and poisson. TODO figure out a simple synthetic data sequence which achieves the best case for l1 and poisson.

figure-timings-laplace-data.R makes figure-timings-meanvar_norm-data.csv

figure-timings-laplace.R reads that file and makes

Figure above shows that ruptures looks asymptotically slower in best case.

figure-timings-meanvar_norm-data.R makes figure-timings-meanvar_norm-data.csv

figure-timings-meanvar_norm.R reads that file and makes

Figure above shows that

blockcpd is about the same as binsegRcpp multiset.
for worst case changepoint is faster up to very large model sizes, but asymptotically slower.

figure-timings-poisson-data.R makes figure-timings-poisson-data.csv

figure-timings-poisson.R reads that file and makes

Figure above shows that

blockcpd about the same as binsegRcpp multiset.
others consistent with other losses.

TODO compare both versions of blockcpd. Also compare with max.segs=n.data since that is what blockcpd does?

24 Mar 2022

figure-neuroblastoma.R makes the figure below, which shows a real data set for which there are differences between binsegRcpp and ruptures/changepoint.

23 Mar 2022

ruptures_bug.py and changepoint.bug.R used to report issues, deepcharles/ruptures#242 and rkillick/changepoint#69

22 Mar 2022

figure-timings-data.R makes figure-timings-data.csv

figure-timings.R reads that and makes

Figure above was created using synthetic data which achieve the best/worst case time complexity of the binary segmentation algorithm. For each data set of a given size N in {2^2=4,8,16,32,…,2^20=1,048,576}, we run binary segmentation up to a max of N/2 segments (and not going to a larger N if the algo/case resulted in a time greater than 100 seconds). The timings suggest that changepoint R package uses a cubic algorithm (three nested for loops) whereas binsegRcpp uses an algorithm which is log-linear in the best case, and quadratic in the worst case. The ruptures python module seems to be asymptotically faster than changepoint but slower than binsegRcpp, maybe quadratic?

Figure above shows that loss for binsegRcpp is always less than loss for others, suggesting that there are bugs in the other implementations.

20 Jan 2022

figure-select-segments-data.R computes simulations using a variety of model selection criteria, saving results to figure-select-segments-data.csv

figure-select-segments.R reads that result CSV file and makes

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
renv		renv
.gitignore		.gitignore
Makefile		Makefile
README.org		README.org
article-ex6.pdf		article-ex6.pdf
article-two-changes.R		article-two-changes.R
article-two-changes.pdf		article-two-changes.pdf
article.Rnw		article.Rnw
article.pdf		article.pdf
article_replication.tgz		article_replication.tgz
changepoint.bug.R		changepoint.bug.R
changepoint.bug.png		changepoint.bug.png
custom_cost.py		custom_cost.py
figure-best-heuristics-segs-constant.png		figure-best-heuristics-segs-constant.png
figure-best-heuristics.R		figure-best-heuristics.R
figure-best-heuristics.png		figure-best-heuristics.png
figure-compare-distributions.R		figure-compare-distributions.R
figure-compare-distributions.png		figure-compare-distributions.png
figure-compare-ruptures.R		figure-compare-ruptures.R
figure-compare-ruptures.png		figure-compare-ruptures.png
figure-comparison-mean.R		figure-comparison-mean.R
figure-comparison-meanvar.R		figure-comparison-meanvar.R
figure-heap-data.R		figure-heap-data.R
figure-heap-data.RData		figure-heap-data.RData
figure-heap-pred.png		figure-heap-pred.png
figure-heap-refs-binsegRcpp.png		figure-heap-refs-binsegRcpp.png
figure-heap-refs-binsegRcpp.tex		figure-heap-refs-binsegRcpp.tex
figure-heap-refs-other.png		figure-heap-refs-other.png
figure-heap-refs-other.tex		figure-heap-refs-other.tex
figure-heap-refs.png		figure-heap-refs.png
figure-heap.R		figure-heap.R
figure-mcgill-iterations-data.R		figure-mcgill-iterations-data.R
figure-mcgill-iterations-data.csv		figure-mcgill-iterations-data.csv
figure-mcgill-iterations.R		figure-mcgill-iterations.R
figure-mcgill-iterations.png		figure-mcgill-iterations.png
figure-neuroblastoma-iterations-bounds.csv		figure-neuroblastoma-iterations-bounds.csv
figure-neuroblastoma-iterations-data.R		figure-neuroblastoma-iterations-data.R
figure-neuroblastoma-iterations-data.csv		figure-neuroblastoma-iterations-data.csv
figure-neuroblastoma-iterations.R		figure-neuroblastoma-iterations.R
figure-neuroblastoma-iterations.png		figure-neuroblastoma-iterations.png
figure-neuroblastoma.R		figure-neuroblastoma.R
figure-neuroblastoma.png		figure-neuroblastoma.png
figure-optimal-trees-60.png		figure-optimal-trees-60.png
figure-optimal-trees-61.png		figure-optimal-trees-61.png
figure-optimal-trees-62.png		figure-optimal-trees-62.png
figure-optimal-trees-63.png		figure-optimal-trees-63.png
figure-optimal-trees-64.png		figure-optimal-trees-64.png
figure-optimal-trees-65.png		figure-optimal-trees-65.png
figure-optimal-trees-66.png		figure-optimal-trees-66.png
figure-optimal-trees-67.png		figure-optimal-trees-67.png
figure-optimal-trees-68.png		figure-optimal-trees-68.png
figure-optimal-trees-69.png		figure-optimal-trees-69.png
figure-optimal-trees-70.png		figure-optimal-trees-70.png
figure-optimal-trees-71.png		figure-optimal-trees-71.png
figure-optimal-trees-72.png		figure-optimal-trees-72.png
figure-optimal-trees-73.png		figure-optimal-trees-73.png
figure-optimal-trees-74.png		figure-optimal-trees-74.png
figure-optimal-trees-75.png		figure-optimal-trees-75.png
figure-optimal-trees-76.png		figure-optimal-trees-76.png
figure-optimal-trees-77.png		figure-optimal-trees-77.png
figure-optimal-trees-78.png		figure-optimal-trees-78.png
figure-optimal-trees-79.png		figure-optimal-trees-79.png
figure-optimal-trees-80.png		figure-optimal-trees-80.png
figure-optimal-trees-81.png		figure-optimal-trees-81.png
figure-optimal-trees-82.png		figure-optimal-trees-82.png
figure-optimal-trees-83.png		figure-optimal-trees-83.png
figure-optimal-trees-84.png		figure-optimal-trees-84.png
figure-optimal-trees-85.png		figure-optimal-trees-85.png
figure-optimal-trees-86.png		figure-optimal-trees-86.png
figure-optimal-trees-87.png		figure-optimal-trees-87.png
figure-optimal-trees-88.png		figure-optimal-trees-88.png
figure-optimal-trees-89.png		figure-optimal-trees-89.png
figure-optimal-trees-90.png		figure-optimal-trees-90.png
figure-optimal-trees-some.png		figure-optimal-trees-some.png
figure-optimal-trees.R		figure-optimal-trees.R
figure-optimal-trees.png		figure-optimal-trees.png
figure-select-segments-compare-number-of-simulations.png		figure-select-segments-compare-number-of-simulations.png
figure-select-segments-data.R		figure-select-segments-data.R
figure-select-segments-data.csv		figure-select-segments-data.csv
figure-select-segments.R		figure-select-segments.R
figure-select-segments.png		figure-select-segments.png
figure-splits-loss-cum.png		figure-splits-loss-cum.png
figure-splits-loss.R		figure-splits-loss.R
figure-splits-loss.png		figure-splits-loss.png
figure-synthetic.R		figure-synthetic.R
figure-synthetic.png		figure-synthetic.png
figure-timings-data.R		figure-timings-data.R
figure-timings-data.csv		figure-timings-data.csv
figure-timings-laplace-data.R		figure-timings-laplace-data.R
figure-timings-laplace-data.csv		figure-timings-laplace-data.csv
figure-timings-laplace-loss.png		figure-timings-laplace-loss.png
figure-timings-laplace.R		figure-timings-laplace.R
figure-timings-laplace.png		figure-timings-laplace.png
figure-timings-loss.png		figure-timings-loss.png
figure-timings-meanvar_norm-data.R		figure-timings-meanvar_norm-data.R
figure-timings-meanvar_norm-data.csv		figure-timings-meanvar_norm-data.csv
figure-timings-meanvar_norm-loss.png		figure-timings-meanvar_norm-loss.png
figure-timings-meanvar_norm.R		figure-timings-meanvar_norm.R
figure-timings-meanvar_norm.png		figure-timings-meanvar_norm.png
figure-timings-poisson-data.R		figure-timings-poisson-data.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Talk slides

Paper: Comparing binsegRcpp with other implementations of binary segmentation for change-point detection

Paper: Finite sample complexity analysis of binary segmentation

TODOs

2 May 2025

7 Aug 2022

12 July 2022

17 May 2022

4 May 2022

21 Apr 2022

13 Apr 2022

11 Apr 2022

6 Apr 2022

24 Mar 2022

23 Mar 2022

22 Mar 2022

20 Jan 2022

About

Uh oh!

Releases

Packages

Languages

tdhock/binseg-model-selection

Folders and files

Latest commit

History

Repository files navigation

Talk slides

Paper: Comparing binsegRcpp with other implementations of binary segmentation for change-point detection

Paper: Finite sample complexity analysis of binary segmentation

TODOs

2 May 2025

7 Aug 2022

12 July 2022

17 May 2022

4 May 2022

21 Apr 2022

13 Apr 2022

11 Apr 2022

6 Apr 2022

24 Mar 2022

23 Mar 2022

22 Mar 2022

20 Jan 2022

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages