1Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030,
USA;
2Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor,
Michigan 48109, USA
Corresponding author: hmkang{at}umich.edu
Abstract
The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume
of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality
variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling,
filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The
pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives.
Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline
provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already
contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project
and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies.
Footnotes
[Supplemental material is available for this article.]
This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue
publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described
at https://creativecommons.org/licenses/by-nc/4.0/.