DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Abstract Background XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. Results DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. Conclusions We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.

Download Full-text

tHapMix: simulating tumour samples through haplotype mixtures

10.1101/057414 ◽

2016 ◽

Author(s):

Sergii Ivakhno ◽

Camilla Colombo ◽

Stephen Tanner ◽

Philip Tedder ◽

Stefano Berri ◽

...

Keyword(s):

Copy Number ◽

Large Scale ◽

Variant Calling ◽

Copy Number Variant ◽

Supplementary Information ◽

Genome Diversity ◽

Simulation Framework ◽

Somatic Genome ◽

Copy Number Changes ◽

Sequencing Platforms

AbstractMotivationLarge-scale rearrangements and copy number changes combined with different modes of cloevolution create extensive somatic genome diversity, making it difficult to develop versatile and scalable oriant calling tools and create well-calibrated benchmarks.ResultsWe developed a new simulation framework tHapMix that enables the creation of tumour samples with different ploidy, purity and polyclonality features. It easily scales to simulation of hundreds of somatic genomes, while re-use of real read data preserves noise and biases present in sequencing platforms. We further demonstrate tHapMix utility by creating a simulated set of 140 somatic genomes and showing how it can be used in training and testing of somatic copy number variant calling tools.Availability and implementationtHapMix is distributed under an open source license and can be downloaded from https://github.com/Illumina/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

ERDS-Exome: A Hybrid Approach for Copy Number Variant Detection from Whole-Exome Sequencing Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2017.2758779 ◽

2020 ◽

Vol 17 (3) ◽

pp. 796-803 ◽

Cited By ~ 2

Author(s):

Renjie Tan ◽

Jixuan Wang ◽

Xiaoliang Wu ◽

Liran Juan ◽

Tianjiao Zhang ◽

...

Keyword(s):

Exome Sequencing ◽

Copy Number ◽

Hybrid Approach ◽

Copy Number Variant ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Variant Detection ◽

Copy Number Variant Detection

Download Full-text

ERDS-pe: A paired hidden Markov model for copy number variant detection from whole-exome sequencing data

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2016.7822508 ◽

2016 ◽

Cited By ~ 2

Author(s):

Renjie Tan ◽

Jixuan Wang ◽

Xiaoliang Wu ◽

Guoqiang Wan ◽

Rongjie Wang ◽

...

Keyword(s):

Markov Model ◽

Copy Number ◽

Hidden Markov ◽

Copy Number Variant ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Variant Detection ◽

Copy Number Variant Detection

Download Full-text

HadoopCNV: A dynamic programming imputation algorithm to detect copy number variants from sequencing data

10.1101/124339 ◽

2017 ◽

Author(s):

Hui Yang ◽

Gary Chen ◽

Leandro Lima ◽

Han Fang ◽

Laura Jimenez ◽

...

Keyword(s):

Dynamic Programming ◽

Copy Number ◽

Large Scale ◽

Read Depth ◽

Copy Number Variations ◽

Detection Methods ◽

Depth Information ◽

Sequencing Data ◽

Cnv Detection ◽

Hadoop Framework

ABSTRACTBACKGROUNDWhole-genome sequencing (WGS) data may be used to identify copy number variations (CNVs). Existing CNV detection methods mostly rely on read depth or alignment characteristics (paired-end distance and split reads) to infer gains/losses, while neglecting allelic intensity ratios and cannot quantify copy numbers. Additionally, most CNV callers are not scalable to handle a large number of WGS samples.METHODSTo facilitate large-scale and rapid CNV detection from WGS data, we developed a Dynamic Programming Imputation (DPI) based algorithm called HadoopCNV, which infers copy number changes through both allelic frequency and read depth information. Our implementation is built on the Hadoop framework, enabling multiple compute nodes to work in parallel.RESULTSCompared to two widely used tools – CNVnator and LUMPY, HadoopCNV has similar or better performance on both simulated data sets and real data on the NA12878 individual. Additionally, analysis on a 10-member pedigree showed that HadoopCNV has a Mendelian precision that is similar or better than other tools. Furthermore, HadoopCNV can accurately infer loss of heterozygosity (LOH), while other tools cannot. HadoopCNV requires only 1.6 hours for a human genome with 30X coverage, on a 32-node cluster, with a linear relationship between speed improvement and the number of nodes. We further developed a method to combine HadoopCNV and LUMPY result, and demonstrated that the combination resulted in better performance than any individual tools.CONCLUSIONSThe combination of high-resolution, allele-specific read depth from WGS data and Hadoop framework can result in efficient and accurate detection of CNVs.

Download Full-text

Pervasive cis effects of variation in copy number of large tandem repeats on local epigenetics and gene expression

10.1101/2020.12.16.423078 ◽

2020 ◽

Author(s):

Paras Garg ◽

Alejandro Martin-Trujillo ◽

Oscar L. Rodriguez ◽

Scott J. Gies ◽

Bharati Jadhav ◽

...

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Copy Number ◽

Large Scale ◽

Tandem Repeats ◽

Local Variation ◽

Read Depth ◽

Variable Number ◽

Human Populations ◽

Functional Variation

ABSTRACTVariable Number Tandem Repeats (VNTRs) are composed of large tandemly repeated motifs, many of which are highly polymorphic in copy number. However, due to their large size and repetitive nature, they remain poorly studied. To investigate the regulatory potential of VNTRs, we used read-depth data from Illumina whole genome sequencing to perform association analysis between copy number of ~70,000 VNTRs (motif size ≥10bp) with both gene expression (404 samples in 48 tissues) and DNA methylation (235 samples in peripheral blood), identifying thousands of VNTRs that are associated with local gene expression (eVNTRs) and DNA methylation levels (mVNTRs). Using large-scale replication analysis in an independent cohort we validated 73-80% of signals observed in the two discovery cohorts, providing robust evidence to support that these represent genuine associations. Further, conditional analysis indicated that many eVNTRs and mVNTRs act as QTLs independently of other local variation. We also observed strong enrichments of eVNTRs and mVNTRs for regulatory features such as enhancers and promoters. Using the Human Genome Diversity Panel, we defined sets of VNTRs that show highly divergent copy numbers among human populations, show that these are enriched for regulatory effects on gene expression and epigenetics, and preferentially associate with genes that have been linked with human phenotypes through GWAS. Our study provides strong evidence supporting functional variation at thousands of VNTRs, and defines candidate sets of VNTRs, copy number variation of which potentially plays a role in numerous human phenotypes.

Download Full-text

CNV-P: a machine-learning framework for predicting high confident copy number variations

PeerJ ◽

10.7717/peerj.12564 ◽

2021 ◽

Vol 9 ◽

pp. e12564

Author(s):

Taifu Wang ◽

Jinghua Sun ◽

Xiuqing Zhang ◽

Wen-Jing Wang ◽

Qing Zhou

Keyword(s):

Machine Learning ◽

False Positive ◽

Copy Number ◽

Genetic Disorders ◽

Genetic Diseases ◽

Basic Research ◽

Read Depth ◽

Copy Number Variations ◽

Sequencing Data ◽

Learning Framework

Background Copy-number variants (CNVs) have been recognized as one of the major causes of genetic disorders. Reliable detection of CNVs from genome sequencing data has been a strong demand for disease research. However, current software for detecting CNVs has high false-positive rates, which needs further improvement. Methods Here, we proposed a novel and post-processing approach for CNVs prediction (CNV-P), a machine-learning framework that could efficiently remove false-positive fragments from results of CNVs detecting tools. A series of CNVs signals such as read depth (RD), split reads (SR) and read pair (RP) around the putative CNV fragments were defined as features to train a classifier. Results The prediction results on several real biological datasets showed that our models could accurately classify the CNVs at over 90% precision rate and 85% recall rate, which greatly improves the performance of state-of-the-art algorithms. Furthermore, our results indicate that CNV-P is robust to different sizes of CNVs and the platforms of sequencing. Conclusions Our framework for classifying high-confident CNVs could improve both basic research and clinical diagnosis of genetic diseases.

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Copy number variant calling on a 177 gene expanded carrier screening panel reveals impact of hbb deletions

Fertility and Sterility ◽

10.1016/j.fertnstert.2017.07.836 ◽

2017 ◽

Vol 108 (3) ◽

pp. e282

Author(s):

K.A. Beauchamp ◽

P. Grauman ◽

G.J. Hogan ◽

K.R. Haas ◽

G.M. Gould ◽

...

Keyword(s):

Copy Number ◽

Variant Calling ◽

Copy Number Variant ◽

Carrier Screening ◽

Expanded Carrier Screening

Download Full-text

Quantification of aneuploidy in targeted sequencing data using ASCETS

Bioinformatics ◽

10.1093/bioinformatics/btaa980 ◽

2020 ◽

Author(s):

Liam F Spurr ◽

Mehdi Touat ◽

Alison M Taylor ◽

Adrian M Dubuc ◽

Juliann Shih ◽

...

Keyword(s):

Copy Number ◽

Large Scale ◽

Genomic Analysis ◽

Targeted Sequencing ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Copy Number Changes ◽

Panel Sequencing ◽

Chromosome Level

Abstract Summary The expansion of targeted panel sequencing efforts has created opportunities for large-scale genomic analysis, but tools for copy-number quantification on panel data are lacking. We introduce ASCETS, a method for the efficient quantitation of arm and chromosome-level copy-number changes from targeted sequencing data. Availability and implementation ASCETS is implemented in R and is freely available to non-commercial users on GitHub: https://github.com/beroukhim-lab/ascets, along with detailed documentation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text