Please don’t use this code! If you want to use PeakSeg for genome-wide peak calling, please use PeakSegPipeline instead. (it implements the same algorithms, but with a better user interface; it should be easier to install and use)
This repository contains scripts for genome-wide supervised ChIP-seq peak prediction, for a single experiment type (e.g. broad H3K36me3 or sharp H3K4me3 data), jointly using multiple samples and cell types.
- Labeling. The first step of any supervised analysis is to label a few genomic regions with and without peaks in your data. The create_track_hub.R script creates a track hub from bigWig files, which makes it easy to visualize data on the UCSC genome browser and then create labels (specific genomic regions where you have observed presence or absence of peaks in specific samples). For more details about labeling see Input Files -> Label Data below.
- Single-sample peak calling using PeakSegFPOP. This repository includes PeakSegFPOP, a command line limited memory PeakSeg functional pruning optimal partitioning algorithm (arXiv:1703.03352), which is used to predict preliminary peaks for each sample independently.
- Multiple-sample peak calling using PeakSegJoint. After single-sample analysis, peaks are clustered into genomic regions with overlapping peaks. In each cluster we run PeakSegJoint to find the most likely common peak positions across multiple samples – this includes a prediction of which samples are up and down for each genomic region.
The following commands can be used to install all the required dependencies on Ubuntu precise (linux.x86_64 architecture). If you have another system, make sure to install R, BerkeleyDB STL, and UCSC tools (bigWigToBedGraph, bedToBigBed). Then execute packages.R to install all required R packages.
sudo aptitude install build-essential r-recommended libdb6.0++-dev libdb6.0-stl-dev bedtools
export BIN=~/bin # or any other directory on your $PATH
curl -L https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph -o $BIN/bigWigToBedGraph
curl -L https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed -o $BIN/bedToBigBed
chmod u+x $BIN/bigWigToBedGraph $BIN/bedToBigBed
make
cp PeakSegFPOP $BIN
Rscript packages.R
BerkeleyDB is used by PeakSegFPOP to write a large temporary file to disk. If you don’t have root or aptitude, then you can always install BerkeleyDB by hand to your home directory
wget https://download.oracle.com/berkeley-db/db-6.2.23.NC.tar.gz
tar xf db-6.2.23.NC.tar.gz
cd db-6.2.23.NC/build_unix
../dist/configure --prefix=$HOME --enable-stl
make
make install
If BerkeleyDB is installed to your home directory or some other non-standard directory, then make sure to edit the Makefile to tell the compiler where to find it.
g++ -std=c++0x -o PeakSegFPOP PeakSegFPOPLog.cpp funPieceListLog.cpp -I$HOME/include -L$HOME/lib -ldb_stl -Wl,-rpath=$HOME/lib
Once everything has been installed, you can test your installation by
executing test_demo.R. It will first download some bigWigs and
labels to the test/demo
directory, then run pipeline.R on
them. If everything worked, you can view the results by opening
test/demo/index.html
in a web browser, and it should be the same as
the results shown on
https://cbio.mines-paristech.fr/~thocking/hubs/test/demo/
The pipeline.R script uses PeakSegFPOP + PeakSegJoint to predict common and different peaks in multiple samples. It requires three kinds of input data:
- coverage data under
project/samples
, - labels in
project/labels
, - genomic segmentation problems in
project/problems.bed
.
The last step of pipeline.R is plot_all.R which creates
index.html
a web page which summarizes the results,peaks_matrix.tsv
a binary matrix (peaks x samples) in which 1 means peak and 0 means no peak.peaks_summary.tsv
is a table with a row for each genomic region that has a peak in at least one sample. The columns arechrom
,peakStart
,peakEnd
genomic region of peak.specificity
if you have labeled peaks in Input samples, the model labels each peak as either specific (few Input samples up), or non-specific (many Input samples up). If you want to filter non-specific Input peaks yourself, you can use then.Input
column, which is the number of Input samples with a peak in this region.loss.diff
the likelihood of the peak (larger values mean taller and wider peaks in more samples).chisq.pvalue
,fisher.pvalue
P-Values from Chi-Squared (chisq.test
) and Fisher’s exact test (fisher.test
) for whether or not this peak is group-specific (lower values mean strong correlation between peak calls and groups).
To give a concrete example, let us consider the data set that is used when you run test_demo.R.
Each coverage data file should contain counts of aligned sequence reads at every genomic position, for one sample. These files can be in either bedGraph or bigWig format (bigWig is preferable since it is indexed and thus faster to read than bedGraph). For example test_demo.R downloads 8 files:
test/demo/samples/bcell/MS026601/coverage.bigWig test/demo/samples/bcell_/MS010302/coverage.bigWig test/demo/samples/Input/MS002201/coverage.bigWig test/demo/samples/Input/MS026601/coverage.bigWig test/demo/samples/Input_/MS002202/coverage.bigWig test/demo/samples/Input_/MS010302/coverage.bigWig test/demo/samples/kidney/MS002201/coverage.bigWig test/demo/samples/kidney_/MS002202/coverage.bigWig
In the example above we have the test/demo
directory which will
contain all data sets, labels, and peak calls for this particular
project. The samples
directory contains a sub-directory for each
sample group (experimental conditions or cell types, e.g. bcell
or
kidney
). Each sample group directory should contain a sub-directory
for each sample (e.g. MS002201
or MS010302
). Each sample
sub-directory should contain either a coverage.bedGraph
or
coverage.bigWig
file with counts of aligned sequence reads.
Note that in this demonstration project, the groups with underscores
are un-labeled samples (e.g. bcell_
), and the groups without
underscores are labeled samples (e.g. bcell
). In real projects
typically you would combine those two groups into a single labeled
group, but in this project we keep them separate in order to
demonstrate the prediction accuracy of the learning algorithm.
The project/labels/*.txt
files contain genomic regions with or without
peaks. These labels will be used to train the peak prediction models
(automatically select model parameters that yield optimal peak
prediction accuracy). A quick and easy way to create labels is by
visual inspection as in the McGill ChIP-seq peak detection benchmark
(for details please read Hocking et al, Bioinformatics 2016).
To visually label your data first create a project directory on a
webserver with project/samples/groupID/sampleID/coverage.bigWig
files, then create a track hub using a command such as
Rscript create_track_hub.R project https://your.server.com/~user/path- hg19 email@domain.com
The arguments of the create_track_hub.R
script are as follows:
- The first argument
project
is the data directory. - The second argument
https://your.server.com/~user/path-
is the URL prefix (appended before the first argument to obtain URLs for the trackDb.txt file). - The third argument
hg19
is the UCSC genome ID for the genomes.txt file. - The fourth argument
email@domain.com
is the email address for the hub.txt file.
If that command worked, then you should see a message Created
https://your.server.com/~user/path-project/hub.txt
and then you can
paste that URL into My Data -> Track Hubs -> My Hubs then click Add
Hub to tell the UCSC genome browser to display your data. Navigate
around the genome until you have found some peaks, then add positive
and negative labels in project/labels/*.txt
files.
For example in test_demo.R the data set contains only one labels file,
test/demo/labels/some_labels.txt
which contains lines such as the following
chr10:33,061,897-33,162,814 noPeaks chr10:33,456,000-33,484,755 peakStart kidney chr10:33,597,317-33,635,209 peakEnd kidney chr10:33,662,034-33,974,942 noPeaks chr10:35,182,820-35,261,001 noPeaks chr10:35,261,418-35,314,654 peakStart bcell kidney
A chunk is a group of nearby labels. In the example above there are two chunks (far apart genomic regions, separated by an empty line). The first chunk has two regions with noPeaks labels in all samples, and two regions with positive labels in kidney samples and noPeaks labels in bcell samples. The second chunk has one region with noPeaks in bcell and kidney samples, and one region with a peakStart label in bcell and kidney samples.
In general, the labels file is divided into separate chunks by empty lines. Each chunk should contain lines for several nearby genomic regions, the corresponding label (noPeaks, peakStart, peakEnd, peaks), and the sample groups to which that label should be assigned (all other groups mentioned in the labels file will receive the noPeaks label). Ideally, each chunk should contain
- At least one label with a peak in all samples.
- At least one label with no peaks in any samples.
- At least one label with a peak in some samples but not others (these labels are crucial for the model to be able to learn what is a significant difference between up and down).
Visualizing labels. After having added some labels in
project/labels/*.txt
files, run Rscript convert_labels.R project
to create project/all_labels.bed
. Then when you re-run Rscript
create_track_hub.R ...
it will create a new hub with a track
“Manually labeled regions with and without peaks” that displays the
labels you have created.
The last input file that you need to provide is a list of separate segmentation problems for your reference genome (regions without gaps). This file should be in BED format (e.g. hg19_problems.bed).
If you don’t use hg19, but you do use another standard genome that is hosted on UCSC, then you can use downloadProblems.R
Rscript downloadProblems.R hg38 hg38_problems.bed
If your reference genome does not exist on UCSC, you can use
gap2problems.R to make a problems.bed
file:
Rscript gap2problems.R yourGenome_gap.bed yourGenome_chromInfo.txt yourGenome_problems.bed
where the chromInfo file contains one line for every chromosome, and
the gap file contains one line for every gap in the reference (unknown
/ NNN sequence). If there are no gaps in your genome, then you can use
yourGenome_chromInfo.txt
as a problems.bed
file.
Since the human genome is so large, we recommend to do model training and peak prediction in parallel. To use a PBS/qsub cluster such as Compute Canada’s guillimin, begin by editing the create_problems_all.R script to reflect your cluster configuration. Then run
cd PeakSegFPOP
Rscript convert_labels.R test/demo
Rscript create_problems_all.R test/demo
That will create problem sub-directories in
test/demo/samples/*/*/problems/*
. Begin model training by computing
target.tsv
files:
for lbed in test/demo/samples/*/*/problems/*/labels.bed;do qsub $(echo $lbed|sed 's/labels.bed/target.tsv.sh/');done
The target is the largest interval of log(penalty) values for which
PeakSegFPOP returns peak models that have the minimum number of
incorrect labels. The target.tsv
files are used for training a
machine learning model that can predict optimal penalty values, even
for un-labeled samples and genome subsets. To train a model, use
Rscript train_model.R test/demo
which trains a model using
test/demo/samples/*/*/problems/*/target.tsv
files, and saves it to
test/demo/model.RData
. To compute peak predictions independently for
each sample and genomic segmentation problem,
for sh in test/demo/problems/*/jointProblems.bed.sh;do qsub $sh;done
which will launch one job for each genomic segmentation problem. Each
job will make peak predictions in all samples, then write
test/demo/problems/*/jointProblems/*
directories with
target.tsv.sh
and peaks.bed.sh
scripts. One directory and joint
segmentation problem will be created for each genomic region which has
at least one sample with a predicted peak. To train a joint peak
calling model, run
qsub test/demo/joint.model.RData.sh
which will compute test/demo/joint.model.RData
and
test/demo/jobs/*/jobProblems.bed
files. To make joint peak
predictions, run
for sh in test/demo/jobs/*/jobPeaks.sh;do qsub $sh;done
To gather all the peak predictions in a summary on
test/demo/index.html
, run
qsub test/demo/peaks_matrix.tsv.sh
Finally, you can create test/demo/hub.txt
which can be used as a
track hub on the UCSC genome browser:
Rscript create_track_hub.R test/demo https://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP- hg19 email@domain.com
The script will create
test/demo/samples/*/*/coverage.bigWig
and
test/demo/samples/*/*/joint_peaks.bigWig
files that will be shown
together on the track hub in a multiWig container (for each sample, a
colored coverage profile with superimposed peak calls as horizontal
black line segments).
The PeakSegFPOP program finds the peak positions and corresponding piecewise constant segment means which optimize the penalized Poisson likelihood.
PeakSegFPOP coverage.bedGraph penalty [tmp.db]
The first argument coverage.bedGraph
is a plain text file with 4
tab-separated columns: chrom, chromStart, chromEnd, coverage (chrom is
character and the others are integers). It should include data for
only one chromosome, and no gap regions.
The second argument penalty
is a non-negative penalty value, for
example 0, 0.1, 1e3, or Inf.
The third argument tmp.db
is optional. It is the path for a
temporary file which takes O(N log N) disk space (N = number of lines
in coverage.bedGraph). In practice you can expect the size of the
temporary file and the computation time to be as in the table
below. Min and max values show the variation over several values of
the penalty parameter (larger penalties require more time and disk
space), on an Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz.
N | min(MB) | max(MB) | min(time) | max(time) |
---|---|---|---|---|
10000 | 12 | 43 | 1 sec | 2 sec |
100000 | 189 | 627 | 12 sec | 25 sec |
1000000 | 3462 | 7148 | 3 min | 5 min |
7135956 | 5042 | 41695 | 18 min | 56 min |
7806082 | 5270 | 33425 | 35 min | 167 min |
For a single run with penalty parameter X
, the PeakSegFPOP program
outputs two files. The coverage.bedGraph_penalty=X_segments.bed
file
has one line for each segment, and the following tab-separated
columns: chrom
, chromStart
, chromEnd
, segment.type
,
segment.mean
. The coverage.bedGraph_penalty=X_loss.tsv
has just
one line and the following tab-separated columns:
penalty
input penalty parameter.segments
number of segments in the optimal model.peaks
number of peaks in the optimal model.bases
number of bases in the bedGraph file.mean.pen.cost
mean penalized Poisson loss.total.cost
total un-penalized Poisson loss. The following equation should hold for all data sets and penalty parameters: (total.cost + penalty * peaks)/bases = mean.pen.coststatus
is the optimal model feasible for the PeakSeg problem with strict inequality constraints? If infeasible, then there is at least one pair of adjacent segment means which are equal (and there is no optimal solution to the problem with strict inequality constraints).mean.intervals
mean count of intervals (Poisson loss function pieces) over all the 2*N cost function models computed by the algorithm.max.intervals
maximum number of intervals.
An in-memory implementation of PeakSegFPOP is available in the coseg R package.
implementation | time | memory | disk |
---|---|---|---|
command line | O(N log N) | O(log N) | O(N log N) |
R pkg coseg | O(N log N) | O(N log N) | 0 |
Note that although both implementations are O(N log N) time complexity for N data points, the command line program is slower due to disk read/write overhead.