xAtlas: Scalable small variant calling across heterogeneous next-generation sequencing experiments

AbstractMotivationThe rapid development of next-generation sequencing (NGS) technologies has lowered the barriers to genomic data generation, resulting in millions of samples sequenced across diverse experimental designs. The growing volume and heterogeneity of these sequencing data complicate the further optimization of methods for identifying DNA variation, especially considering that curated highconfidence variant call sets commonly used to evaluate these methods are generally developed by reference to results from the analysis of comparatively small and homogeneous sample sets.ResultsWe have developed xAtlas, an application for the identification of single nucleotide variants (SNV) and small insertions and deletions (indels) in NGS data. xAtlas is easily scalable and enables execution and retraining with rapid development cycles. Generation of variant calls in VCF or gVCF format from BAM or CRAM alignments is accomplished in less than one CPU-hour per 30× short-read human whole-genome. The retraining capabilities of xAtlas allow its core variant evaluation models to be optimized on new sample data and user-defined truth sets. Obtaining SNV and indels calls from xAtlas can be achieved more than 40 times faster than established methods while retaining the same accuracy.AvailabilityFreely available under a BSD 3-clause license at https://github.com/jfarek/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Pisces: An Accurate and Versatile Variant Caller for Somatic and Germline Next-Generation Sequencing Data

10.1101/291641 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tamsen Dunn ◽

Gwenn Berry ◽

Dorothea Emig-Agius ◽

Yu Jiang ◽

Serena Lei ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gene Mutations ◽

Variant Calling ◽

Amplicon Sequencing ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ras Gene ◽

Generation Sequencing

AbstractMotivationNext-Generation Sequencing (NGS) technology is transitioning quickly from research labs to clinical settings. The diagnosis and treatment selection for many acquired and autosomal conditions necessitate a method for accurately detecting somatic and germline variants, suitable for the clinic.ResultsWe have developed Pisces, a rapid, versatile and accurate small variant calling suite designed for somatic and germline amplicon sequencing applications. Pisces accuracy is achieved by four distinct modules, the Pisces Read Stitcher, Pisces Variant Caller, the Pisces Variant Quality Recalibrator, and the Pisces Variant Phaser. Each module incorporates a number of novel algorithmic strategies aimed at reducing noise or increasing the likelihood of detecting a true variant.AvailabilityPisces is distributed under an open source license and can be downloaded from https://github.com/Illumina/Pisces. Pisces is available on the BaseSpace™ SequenceHub as part of the TruSeq Amplicon workflow and the Illumina Ampliseq Workflow. Pisces is distributed on Illumina sequencing platforms such as the MiSeq™, and is included in the Praxis™ Extended RAS Panel test which was recently approved by the FDA for the detection of multiple RAS gene [email protected] informationSupplementary data are available online.

Download Full-text

Spatially constrained tumour growth affects the patterns of clonal selection and neutral drift in cancer genomic data

10.1101/544536 ◽

2019 ◽

Cited By ~ 3

Author(s):

Kate Chkhaidze ◽

Timon Heide ◽

Benjamin Werner ◽

Marc J. Williams ◽

Weini Huang ◽

...

Keyword(s):

Next Generation Sequencing ◽

Tumour Growth ◽

Evolutionary Dynamics ◽

Clonal Selection ◽

Genomic Data ◽

Confounding Factors ◽

Data Generation ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

AbstractQuantification of the effect of spatial tumour sampling on the patterns of mutations detected in next-generation sequencing data is largely lacking. Here we use a spatial stochastic cellular automaton model of tumour growth that accounts for somatic mutations, selection, drift and spatial constrains, to simulate multi-region sequencing data derived from spatial sampling of a neoplasm. We show that the spatial structure of a solid cancer has a major impact on the detection of clonal selection and genetic drift from bulk sequencing data and single-cell sequencing data. Our results indicate that spatial constrains can introduce significant sampling biases when performing multi-region bulk sampling and that such bias becomes a major confounding factor for the measurement of the evolutionary dynamics of human tumours. We present a statistical inference framework that takes into account the spatial effects of a growing tumour and allows inferring the evolutionary dynamics from patient genomic data. Our analysis shows that measuring cancer evolution using next-generation sequencing while accounting for the numerous confounding factors requires a mechanistic model-based approach that captures the sources of noise in the data.SummarySequencing the DNA of cancer cells from human tumours has become one of the main tools to study cancer biology. However, sequencing data are complex and often difficult to interpret. In particular, the way in which the tissue is sampled and the data are collected, impact the interpretation of the results significantly. We argue that understanding cancer genomic data requires mathematical models and computer simulations that tell us what we expect the data to look like, with the aim of understanding the impact of confounding factors and biases in the data generation step. In this study, we develop a spatial simulation of tumour growth that also simulates the data generation process, and demonstrate that biases in the sampling step and current technological limitations severely impact the interpretation of the results. We then provide a statistical framework that can be used to overcome these biases and more robustly measure aspects of the biology of tumours from the data.

Download Full-text

VikNGS: A C ++ Variant Integration Kit for Next Generation Sequencing Association Analysis

Bioinformatics ◽

10.1093/bioinformatics/btz716 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zeynep Baskurt ◽

Scott Mastromatteo ◽

Jiafen Gong ◽

Richard F Wintle ◽

Stephen W Scherer ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Association ◽

Association Analysis ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Next Generation ◽

Sequencing Data ◽

Combining Data ◽

Generation Sequencing

Abstract Integration of next generation sequencing data (NGS) across different research studies can improve the power of genetic association testing by increasing sample size and can obviate the need for sequencing controls. If differential genotype uncertainty across studies is not accounted for, combining data sets can produce spurious association results. We developed the Variant Integration Kit for NGS (VikNGS), a fast cross-platform software package, to enable aggregation of several data sets for rare and common variant genetic association analysis of quantitative and binary traits with covariate adjustment. VikNGS also includes a graphical user interface, power simulation functionality and data visualization tools. Availability The VikNGS package can be downloaded at http://www.tcag.ca/tools/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Hi-C chromosome conformation capture sequencing of avian genomes using the BGISEQ-500 platform

GigaScience ◽

10.1093/gigascience/giaa087 ◽

2020 ◽

Vol 9 (8) ◽

Author(s):

Marcela Sandoval-Velasco ◽

Juan Antonio Rodríguez ◽

Cynthia Perez Estrada ◽

Guojie Zhang ◽

Erez Lieberman Aiden ◽

...

Keyword(s):

Next Generation Sequencing ◽

High Throughput Sequencing ◽

Data Generation ◽

Next Generation ◽

Sequencing Data ◽

Yield Data ◽

Chromosome Conformation ◽

Sequencing Platform ◽

Sequencing Platforms ◽

Generation Sequencing

Abstract Background Hi-C experiments couple DNA-DNA proximity with next-generation sequencing to yield an unbiased description of genome-wide interactions. Previous methods describing Hi-C experiments have focused on the industry-standard Illumina sequencing. With new next-generation sequencing platforms such as BGISEQ-500 becoming more widely available, protocol adaptations to fit platform-specific requirements are useful to give increased choice to researchers who routinely generate sequencing data. Results We describe an in situ Hi-C protocol adapted to be compatible with the BGISEQ-500 high-throughput sequencing platform. Using zebra finch (Taeniopygia guttata) as a biological sample, we demonstrate how Hi-C libraries can be constructed to generate informative data using the BGISEQ-500 platform, following circularization and DNA nanoball generation. Our protocol is a modification of an Illumina-compatible method, based around blunt-end ligations in library construction, using un-barcoded, distally overhanging double-stranded adapters, followed by amplification using indexed primers. The resulting libraries are ready for circularization and subsequent sequencing on the BGISEQ series of platforms and yield data similar to what can be expected using Illumina-compatible approaches. Conclusions Our straightforward modification to an Illumina-compatible in situHi-C protocol enables data generation on the BGISEQ series of platforms, thus expanding the options available for researchers who wish to utilize the powerful Hi-C techniques in their research.

Download Full-text

Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism

10.1101/130732 ◽

2017 ◽

Author(s):

Jade C.S. Chung ◽

Swaine L. Chen

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Quality Score ◽

Identification Accuracy ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Data ◽

Base Quality Score ◽

Generation Sequencing

AbstractNext-generation sequencing data is accompanied by quality scores that quantify sequencing error. Inaccuracies in these quality scores propagate through all subsequent analyses; thus base quality score recalibration is a standard step in many next-generation sequencing workflows, resulting in improved variant calls. Current base quality score recalibration algorithms rely on the assumption that sequencing errors are already known; for human resequencing data, relatively complete variant databases facilitate this. However, because existing databases are still incomplete, recalibration is still inaccurate; and most organisms do not have variant databases, exacerbating inaccuracy for non-human data. To overcome these logical and practical problems, we introduce Lacer, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants. Lacer is the first logically sound, fully general, and truly accurate base recalibrator. Lacer enhances variant identification accuracy for resequencing data of human as well as other organisms (which are not accessible to current recalibrators), simultaneously improving and extending the benefits of base quality score recalibration to nearly all ongoing sequencing projects. Lacer is available at: https://github.com/swainechen/lacer.

Download Full-text

Wx: a neural network-based feature selection algorithm for next-generation sequencing data

10.1101/221911 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sungsoo Park ◽

Bonggun Shin ◽

Yoonjung Choi ◽

Kilsoo Kang ◽

Keunsoo Kang

Keyword(s):

Neural Network ◽

Gene Expression ◽

Next Generation Sequencing ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Selection Algorithm ◽

Sequencing Data ◽

Optimal Set ◽

Generation Sequencing

AbstractMotivationNext-generation sequencing (NGS), which allows the simultaneous sequencing of billions of DNA fragments simultaneously, has revolutionized how we study genomics and molecular biology by generating genome-wide molecular maps of molecules of interest. For example, an NGS-based transcriptomic assay called RNA-seq can be used to estimate the abundance of approximately 190,000 transcripts together. As the cost of next-generation sequencing sharply declines, researchers in many fields have been conducting research using NGS. The amount of information produced by NGS has made it difficult for researchers to choose the optimal set of target genes (or genomic loci).ResultsWe have sought to resolve this issue by developing a neural network-based feature (gene) selection algorithm called Wx. The Wx algorithm ranks genes based on the discriminative index (DI) score that represents the classification power for distinguishing given groups. With a gene list ranked by DI score, researchers can institutively select the optimal set of genes from the highest-ranking ones. We applied the Wx algorithm to a TCGA pan-cancer gene-expression cohort to identify an optimal set of gene-expression biomarker (universal gene-expression biomarkers) candidates that can distinguish cancer samples from normal samples for 12 different types of cancer. The 14 gene-expression biomarker candidates identified by Wx were comparable to or outperformed previously reported universal gene expression biomarkers, highlighting the usefulness of the Wx algorithm for next-generation sequencing data. Thus, we anticipate that the Wx algorithm can complement current state-of-the-art analytical applications for the identification of biomarker candidates as an alternative method.Availabilityhttps://github.com/deargen/[email protected] informationSupplementary data are available at online.

Download Full-text

NGSphy: phylogenomic simulation of next-generation sequencing data

10.1101/197715 ◽

2017 ◽

Author(s):

Merly Escalona ◽

Sara Rocha ◽

David Posada

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Gene Families ◽

Common Species ◽

Next Generation Sequencing Data ◽

Phylogenomic Analysis ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Technologies ◽

Generation Sequencing

AbstractMotivationAdvances in sequencing technologies have made it feasible to obtain massive datasets for phylogenomic inference, often consisting of large numbers of loci from multiple species and individuals. The phylogenomic analysis of next-generation sequencing (NGS) data implies a complex computational pipeline where multiple technical and methodological decisions are necessary that can influence the final tree obtained, like those related to coverage, assembly, mapping, variant calling and/or phasing.ResultsTo assess the influence of these variables we introduce NGSphy, an open-source tool for the simulation of Illumina reads/read counts obtained from haploid/diploid individual genomes with thousands of independent gene families evolving under a common species tree. In order to resemble real NGS experiments, NGSphy includes multiple options to model sequencing coverage (depth) heterogeneity across species, individuals and loci, including off-target or uncaptured loci. For comprehensive simulations covering multiple evolutionary scenarios, parameter values for the different replicates can be sampled from user-defined statistical distributions.AvailabilitySource code, full documentation and tutorials including a quick start guide are available at http://github.com/merlyescalona/[email protected]. [email protected]

Download Full-text