Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses

Abstract Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a ‘G.G’ context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.

Download Full-text

Evaluating assembly and variant calling software for strain-resolved analysis of large DNA-viruses

10.1101/2020.05.14.095265 ◽

2020 ◽

Author(s):

Z.-L. Deng ◽

A. Dhingra ◽

A. Fritz ◽

J. Götting ◽

P. C. Münch ◽

...

Keyword(s):

Best Practices ◽

High Throughput Sequencing ◽

Large Fraction ◽

Variant Calling ◽

Data Sets ◽

Library Preparation ◽

Experimental Protocol ◽

Dna Viruses ◽

Benchmark Data ◽

General Public License

AbstractInfection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and six variant callers on ten lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data.Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive (FP) variant calls, which were strongly enriched in T to G changes in a “G.G” context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.

Download Full-text

Thermal-Hydraulic Investigations of a Horizontal Dry Cask Simulator

Volume 3: Student Paper Competition; Thermal-Hydraulics; Verification and Validation ◽

10.1115/icone2020-16598 ◽

2020 ◽

Author(s):

Ramon J. M. Pulido ◽

Eric R. Lindgren ◽

Samuel G. Durbin ◽

Alex Salazar

Keyword(s):

Best Practices ◽

Thermal Loading ◽

Spent Nuclear Fuel ◽

Data Sets ◽

Temperature Profiles ◽

Benchmark Data ◽

Axial Temperature ◽

Wide Range ◽

Cooling Air Flow ◽

Cooling Air

Abstract Recent advances in horizontal cask designs for commercial spent nuclear fuel have significantly increased maximum thermal loading. This is due in part to greater efficiency in internal conduction pathways. Carefully measured data sets generated from testing of full-sized casks or smaller cask analogs are widely recognized as vital for validating thermal-hydraulic models of these storage cask designs. While several testing programs have been previously conducted, these earlier validation studies did not integrate all the physics or components important in a modern, horizontal dry cask system. The purpose of this investigation is to produce data sets that can be used to benchmark the codes and best practices presently used to calculate cladding temperatures and induced cooling air flows in modern, horizontal dry storage systems. The horizontal dry cask simulator (HDCS) has been designed to generate this benchmark data and complement the existing knowledge base. Transverse and axial temperature profiles along with induced-cooling air flow are measured using various backfills of gases for a wide range of decay powers and canister pressures. The data from the HDCS tests will be used to host a blind model validation effort.

Download Full-text

Comparison of sequencing data processing pipelines and application to underrepresented African human populations

BMC Bioinformatics ◽

10.1186/s12859-021-04407-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gwenna Breton ◽

Anna C. V. Johansson ◽

Per Sjödin ◽

Carina M. Schlebusch ◽

Mattias Jakobsson

Keyword(s):

Best Practices ◽

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Human Populations ◽

Sequencing Data ◽

High Coverage ◽

Individual Level ◽

Bioinformatic Tools ◽

The Individual

Abstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK “Best Practices” are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.

Download Full-text

Highly reproducible characterization of Escherichia coli tRNA epitranscriptome with a simple method of library preparation for deep sequencing

10.1101/2020.02.21.958157 ◽

2020 ◽

Author(s):

Ji Wang ◽

Claire Toffano-Nioche ◽

Florence Lorieux ◽

Daniel Gautheret ◽

Jean Lehmann

Keyword(s):

Escherichia Coli ◽

Deep Sequencing ◽

High Throughput Sequencing ◽

Large Fraction ◽

Model System ◽

Cdna Libraries ◽

Methyl Groups ◽

Library Preparation ◽

Simple Method

ABSTRACTIn conventional RNA high-throughput sequencing, modified bases prevent a large fraction of tRNA transcripts to be converted into cDNA libraries. Recent proposals aiming at resolving this issue take advantage of the interference of base modifications with RT enzymes to detect and identify them by establishing signals from aborted cDNA transcripts. Because some modifications, such as methyl groups, do almost not allow RT bypassing, demethylation and highly processive RT enzymes have been used to overcome these obstacles. Working with Escherichia coli as a model system, we show that with a conventional (albeit still engineered) RT enzyme and key optimizations in library preparation, all RT-impairing modifications can be highlighted along the entire tRNA length without a demethylation procedure. This is achieved by combining deep-sequencing samples, which allows to establish aborted transcription signal of higher accuracy and reproducibility, with the potential for differentiating tiny differences in the state of modification of all cellular tRNAs. In addition, our protocol provides estimates of the relative tRNA abundance.

Download Full-text

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow

BMC Bioinformatics ◽

10.1186/s12859-021-04317-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jochen Bathke ◽

Gesine Lühken

Keyword(s):

Best Practices ◽

Open Source ◽

High Throughput Sequencing ◽

Variant Calling ◽

Model Organisms ◽

Phenotypic Trait ◽

Sequencing Data ◽

Wide Range ◽

Major Player ◽

Over Time

Abstract Background The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time. Results A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half. Conclusions The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.

Download Full-text

Faculty Opinions recommendation of Benchmark data sets for structure-based computational target prediction.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718516631.793500133 ◽

2014 ◽

Author(s):

Vytas Bankaitis ◽

Ashutosh Tripathi

Keyword(s):

Target Prediction ◽

Data Sets ◽

Benchmark Data

Download Full-text

Non-Covalent Interactions Atlas Benchmark Data Sets 3: Repulsive Contacts

Journal of Chemical Theory and Computation ◽

10.1021/acs.jctc.0c01341 ◽

2021 ◽

Vol 17 (3) ◽

pp. 1548-1561

Author(s):

Kristian Kříž ◽

Martin Nováček ◽

Jan Řezáč

Keyword(s):

Data Sets ◽

Benchmark Data ◽

Non Covalent Interactions ◽

Covalent Interactions

Download Full-text

Bayesian Trigonometric Support Vector Classifier

Neural Computation ◽

10.1162/089976603322297368 ◽

2003 ◽

Vol 15 (9) ◽

pp. 2227-2254 ◽

Cited By ~ 20

Author(s):

Wei Chu ◽

S. Sathiya Keerthi ◽

Chong Jin Ong

Keyword(s):

Loss Function ◽

Gaussian Processes ◽

Likelihood Function ◽

Support Vector ◽

Data Sets ◽

Model Adaptation ◽

Bayesian Techniques ◽

Benchmark Data ◽

Support Vector Classifier ◽

Set Up

This letter describes Bayesian techniques for support vector classification. In particular, we propose a novel differentiable loss function, called the trigonometric loss function, which has the desirable characteristic of natural normalization in the likelihood function, and then follow standard gaussian processes techniques to set up a Bayesian framework. In this framework, Bayesian inference is used to implement model adaptation, while keeping the merits of support vector classifier, such as sparseness and convex programming. This differs from standard gaussian processes for classification. Moreover, we put forward class probability in making predictions. Experimental results on benchmark data sets indicate the usefulness of this approach.

Download Full-text

PS2-54: Best Practices: Improving Quality and Reliability in Research Data Sets

Clinical Medicine & Research ◽

10.3121/cmr.2013.1176.ps2-54 ◽

2013 ◽

Vol 11 (3) ◽

pp. 157-157

Author(s):

L. McFarland ◽

J. Richter ◽

C. Bredfeldt

Keyword(s):

Best Practices ◽

Research Data ◽

Data Sets ◽

Quality And Reliability

Download Full-text

Population-specific genome graphs improve high-throughput sequencing data analysis: A case study on the Pan-African genome

10.1101/2021.03.19.436173 ◽

2021 ◽

Author(s):

H. Serhat Tetikol ◽

Kubra Narci ◽

Deniz Turgut ◽

Gungor Budak ◽

Ozem Kalay ◽

...

Keyword(s):

High Throughput Sequencing ◽

Information Overload ◽

African Ancestry ◽

Sample Selection ◽

Variant Calling ◽

Population Diversity ◽

Human Populations ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Graph Augmentation

ABSTRACTGraph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference for capturing the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based bioinformatics toolkits, how to curate genomic variants and subsequently construct genome graphs remains an understudied problem that inevitably determines the effectiveness of the end-to-end bioinformatics pipeline. In this study, we discuss major obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and test the proposed approach on the whole-genome samples of African ancestry. Our results show that, as more representative alternatives to linear or generic graph references, population-specific graphs can achieve significantly lower read mapping errors, increased variant calling sensitivity and provide the improvements of joint variant calling without the need of computationally intensive post-processing steps.

Download Full-text