BALSA: Integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 hours to process 50-fold whole genome sequencing (~750 million 100bp paired-end reads), or just 25 minutes for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU

PeerJ ◽

10.7717/peerj.421 ◽

2014 ◽

Vol 2 ◽

pp. e421 ◽

Cited By ~ 10

Author(s):

Ruibang Luo ◽

Yiu-Lun Wong ◽

Wai-Chun Law ◽

Lap-Kei Lee ◽

Jeanno Cheung ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Secondary Analysis ◽

Whole Genome ◽

Whole Exome

Download Full-text

DeNovoCNN: A deep learning approach to de novo variant calling in next generation sequencing data

10.1101/2021.09.20.461072 ◽

2021 ◽

Author(s):

Gelana Khazeeva ◽

Karolis Sablauskas ◽

Bart van der Sanden ◽

Wouter Steyaert ◽

Michael Kwint ◽

...

Keyword(s):

Exome Sequencing ◽

De Novo ◽

Genetic Disorders ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Accurate Identification ◽

Whole Exome ◽

De Novo Variant ◽

Generation Sequencing

De novo mutations (DNMs) are an important cause of genetic disorders. The accurate identification of DNMs from sequencing data is therefore fundamental to rare disease research and diagnostics. Unfortunately, identifying reliable DNMs remains a major challenge due to sequence errors, uneven coverage, and mapping artifacts. Here, we developed a deep convolutional neural network (CNN) DNM caller (DeNovoCNN), that encodes alignment of sequence reads for a trio as 160×164 resolution images. DeNovoCNN was trained on DNMs of whole exome sequencing (WES) of 2003 trios achieving on average 99.2% recall and 93.8% precision. We find that DeNovoCNN has increased recall/sensitivity and precision compared to existing de novo calling approaches (GATK, DeNovoGear, Samtools) based on the Genome in a Bottle reference dataset. Sanger validations of DNMs called in both exome and genome datasets confirm that DeNovoCNN outperforms existing methods. Most importantly, we show that DeNovoCNN is robust against different exome sequencing and analyses approaches, thereby allowing it to be applied on other datasets. DeNovoCNN is freely available and can be run on existing alignment (BAM/CRAM) and variant calling (VCF) files from WES and WGS without a need for variant recalling.

Download Full-text

Peer Review #2 of "BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU (v0.1)"

10.7287/peerj.421v0.1/reviews/2 ◽

2014 ◽

Author(s):

L Coin

Keyword(s):

Peer Review ◽

Exome Sequencing ◽

Whole Exome Sequencing ◽

Secondary Analysis ◽

Whole Genome ◽

Whole Exome

Download Full-text

Peer Review #1 of "BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU (v0.1)"

10.7287/peerj.421v0.1/reviews/1 ◽

2014 ◽

Author(s):

J Zook

Keyword(s):

Peer Review ◽

Exome Sequencing ◽

Whole Exome Sequencing ◽

Secondary Analysis ◽

Whole Genome ◽

Whole Exome

Download Full-text

Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers

Genome Medicine ◽

10.1186/s13073-017-0425-1 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 28

Author(s):

Maurizio Callari ◽

Stephen-John Sammut ◽

Leticia De Mattos-Arruda ◽

Alejandra Bruna ◽

Oscar M. Rueda ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Variant Calling ◽

Combine Approach ◽

Sequencing Data ◽

Somatic Variant ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data

Download Full-text

Improved Variant Calling Accuracy by Merging Replicates in Whole-Exome Sequencing Studies

BioMed Research International ◽

10.1155/2014/319534 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 4

Author(s):

Yanfeng Zhang ◽

Bingshan Li ◽

Chun Li ◽

Qiuyin Cai ◽

Wei Zheng ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Large Scale ◽

Comprehensive Evaluation ◽

Large Population ◽

Variant Calling ◽

Population Based ◽

Sequencing Data ◽

Whole Exome ◽

Lower Depth

In large scale population-based whole-exome sequencing (WES) studies, there are some samples occasionally sequenced two or more times due to a variety of reasons. To investigate how to efficiently utilize these duplicated sequencing data, we conducted comprehensive evaluation of variant calling strategies. 92 samples subjected to WES twice were selected from a large population study. These 92 duplicated samples were divided into two groups: group H consisting of the higher sequencing depth for each subject and group L consisting of the lower depth for each subject. The merged samples for each subject were put in a third group M. Using the GATK multisample toolkit, we compared variant calling accuracy among three strategies. Hierarchical clustering analysis indicated that the two replicates for each subject showed high homogeneity. The comparative analyses on the basis of heterozygous-homozygous ratio (Hete/Homo), transition-transversion ratio (Ti/Tv), and overlapping rate with the 1000 Genomes Project consistently showed that the data quality of the SNPs detected from the M group was more accurate than that of SNPs detected from the H and L groups. These results suggested that merging homogeneous duplicated exomes instead of using one of them could improve variant calling accuracy.

Download Full-text