GBStools: A Unified Approach for Reduced Representation Sequencing and Genotyping

Mapping Intimacies ◽

10.1101/030494 ◽

2015 ◽

Author(s):

Thomas F Cooke ◽

Muh-Ching Yee ◽

Marina Muzzio ◽

Alexandra Sockell ◽

Ryan Bell ◽

...

Keyword(s):

Restriction Site ◽

Variant Calling ◽

Simulated Data ◽

Error Rates ◽

Genomic Diversity ◽

Model Organisms ◽

Data Sets ◽

Reduced Representation ◽

Restriction Site Polymorphisms ◽

Reduced Representation Sequencing

Reduced representation sequencing methods such as genotyping-by-sequencing (GBS) enable low-cost measurement of genetic variation without the need for a reference genome assembly. These methods are widely used in genetic mapping and population genetics studies, especially with non-model organisms. Variant calling error rates, however, are higher in GBS than in standard sequencing, in particular due to restriction site polymorphisms, and few computational tools exist that specifically model and correct these errors. We developed a statistical method to remove errors caused by restriction site polymorphisms, implemented in the software package GBStools. We evaluated it in several simulated data sets, varying in number of samples, mean coverage and population mutation rate, and in two empirical human data sets (N = 8 and N = 63 samples). In our simulations, GBStools improved genotype accuracy more than commonly used filters such as Hardy-Weinberg equilibrium p-values. GBStools is most effective at removing genotype errors in data sets over 100 samples when coverage is 40X or higher, and the improvement is most pronounced in species with high genomic diversity. We also demonstrate the utility of GBS and GBStools for human population genetic inference in Argentine populations and reveal widely varying individual ancestry proportions and an excess of singletons, consistent with recent population growth.

Estimating and accounting for genotyping errors in RAD-seq experiments

10.1101/587428 ◽

2019 ◽

Cited By ~ 2

Author(s):

Luisa Bresadola ◽

Vivian Link ◽

C. Alex Buerkle ◽

Christian Lexer ◽

Daniel Wegmann

Keyword(s):

Low Cost ◽

Population Sample ◽

Error Rates ◽

Ease Of Use ◽

Model Organisms ◽

Sequencing Error ◽

Genotyping Error ◽

Genotyping Errors ◽

Reduced Representation ◽

Reduced Representation Sequencing

AbstractIn non-model organisms, evolutionary questions are frequently addressed using reduced representation sequencing techniques due to their low cost, ease of use, and because they do not require genomic resources such as a reference genome. However, evidence is accumulating that such techniques may be affected by specific biases, questioning the accuracy of obtained genotypes, and as a consequence, their usefulness in evolutionary studies. Here we introduce three strategies to estimate genotyping error rates from such data: through the comparison to high quality genotypes obtained with a different technique, from individual replicates, or from a population sample when assuming Hardy-Weinberg equilibrium. Applying these strategies to data obtained with Restriction site Associated DNA sequencing (RAD-seq), arguably the most popular reduced representation sequencing technique, revealed per-allele genotyping error rates that were much higher than sequencing error rates, particularly at heterozygous sites that were wrongly inferred as homozygous. As we exemplify through the inference of genome-wide and local ancestry of well characterized hybrids of two Eurasian poplar (Populus) species, such high error rates may lead to wrong biological conclusions. By properly accounting for these error rates in downstream analyses, either by incorporating genotyping errors directly or by recalibrating genotype likelihoods, we were nevertheless able to use the RAD-seq data to support biologically meaningful and robust inferences of ancestry among Populus hybrids. Based on these findings, we strongly recommend carefully assessing genotyping error rates in reduced representation sequencing experiments, and to properly account for these in downstream analyses, for instance using the tools presented here.

Accuracy of de novo assembly of DNA sequences from double-digest libraries varies substantially among software

10.1101/706531 ◽

2019 ◽

Author(s):

Melanie E. F. LaCava ◽

Ellen O. Aikens ◽

Libby C. Megna ◽

Gregg Randolph ◽

Charley Hubbard ◽

...

Keyword(s):

Dna Sequences ◽

De Novo ◽

Homo Sapiens ◽

Simulated Data ◽

Model Organisms ◽

Reduced Representation ◽

Insertion And Deletion ◽

Large Sets ◽

Sequencing Library ◽

Software Programs

AbstractAdvances in DNA sequencing have made it feasible to gather genomic data for non-model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD-HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated datasets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD-HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.

A population genomics approach to uncover the CNVs, and their evolutionary significance, hidden in reduced‐representation sequencing data sets

Molecular Ecology ◽

10.1111/mec.15665 ◽

2020 ◽

Vol 29 (24) ◽

pp. 4749-4753

Author(s):

Anna Tigano

Keyword(s):

Population Genomics ◽

Evolutionary Significance ◽

Data Sets ◽

Sequencing Data ◽

Reduced Representation ◽

Reduced Representation Sequencing

Bayesian non-parametric clustering of single-cell mutation profiles

10.1101/2020.01.15.907345 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nico Borgsmüller ◽

Jose Bonet ◽

Francesco Marass ◽

Abel Gonzalez-Perez ◽

Nuria Lopez-Bigas ◽

...

Keyword(s):

Single Cell ◽

Dirichlet Process ◽

Tumor Heterogeneity ◽

Missing Values ◽

Parametric Method ◽

Simulated Data ◽

Error Rates ◽

Data Sets ◽

Dirichlet Process Mixture ◽

Non Parametric

AbstractThe high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intra-tumor heterogeneity by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq data sets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods. Here we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. BnpC employs a Dirichlet process mixture model coupled with a Markov chain Monte Carlo sampling scheme, including a modified split-merge move and a novel posterior estimator to predict clones and genotypes. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq data sets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime, and scalability. Its inferred genotypes were the most accurate, and it was the only method able to run and produce results on data sets with 10,000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by supplementary experimental data. With ever growing scDNA-seq data sets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve intra-tumor heterogeneity but also as a pre-processing step to reduce data size. BnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC.

Using a Visual Structured Criterion for the Analysis of Alternating-Treatment Designs

Behavior Modification ◽

10.1177/0145445517739278 ◽

2017 ◽

Vol 43 (1) ◽

pp. 115-131 ◽

Cited By ~ 3

Author(s):

Marc J. Lanovaz ◽

Patrick Cardinal ◽

Mary Francis

Keyword(s):

Visual Analysis ◽

Type I Error ◽

Single Case ◽

Simulated Data ◽

Error Rates ◽

Data Sets ◽

Type I ◽

Type I Error Rates ◽

Single Case Designs ◽

Treatment Designs

Although visual inspection remains common in the analysis of single-case designs, the lack of agreement between raters is an issue that may seriously compromise its validity. Thus, the purpose of our study was to develop and examine the properties of a simple structured criterion to supplement the visual analysis of alternating-treatment designs. To this end, we generated simulated data sets with varying number of points, number of conditions, effect sizes, and autocorrelations, and then measured Type I error rates and power produced by the visual structured criterion (VSC) and permutation analyses. We also validated the results for Type I error rates using nonsimulated data. Overall, our results indicate that using the VSC as a supplement for the analysis of systematically alternating-treatment designs with at least five points per condition generally provides adequate control over Type I error rates and sufficient power to detect most behavior changes.

Full-length de novo viral quasispecies assembly through variation graph construction

10.1101/287177 ◽

2018 ◽

Cited By ~ 1

Author(s):

Jasmijn A. Baaijens ◽

Bastiaan Van der Roest ◽

Johannes Köster ◽

Leen Stougie ◽

Alexander Schönhuth

Keyword(s):

Reference Genome ◽

De Novo ◽

Simulated Data ◽

Error Rates ◽

Full Length ◽

Maximal Length ◽

Data Sets ◽

Viral Quasispecies ◽

Suitable Structure ◽

Selection Of

AbstractMotivationViruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly refers to reconstructing the strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains, an important step for various treatment-related reasons. Reference-genome-independent (“de novo”) approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. It remains to reconstruct full-length haplotypes together with their abundances from such contigs.MethodWe first construct a variation graph, a recently popular, suitable structure for arranging and integrating several related genomes, from the short input contigs, without making use of a reference genome. To obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that yields a selection of maximal-length paths that is optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances.ResultsBenchmarking experiments on challenging simulated data sets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates. As a consequence, our method outperforms all state-of-the-art viral quasispecies assemblers that aim at the construction of full-length haplotypes, in terms of various relevant assembly measures. Our tool, Virus-VG, is publicly available at https://bitbucket.org/jbaaijens/virus-vg.

Facilitating population genomics of non-model organisms through optimized experimental design for reduced representation sequencing

10.1101/2021.03.30.437642 ◽

2021 ◽

Author(s):

Henrik Christiansen ◽

Franz M. Heindler ◽

Bart Hellemans ◽

Quentin Jossart ◽

Francesca Pasotti ◽

...

Keyword(s):

Genome Size ◽

Fine Tuning ◽

Alternative Methods ◽

Model Organisms ◽

Restriction Enzyme Digestion ◽

Marker Density ◽

Reduced Representation ◽

A Genome ◽

High Marker Density ◽

Reduced Representation Sequencing

Genome-wide data are invaluable to characterize differentiation and adaptation of natural populations. Reduced representation sequencing (RRS) subsamples a genome repeatedly across many individuals. However, RRS requires careful optimization and fine-tuning to deliver high marker density while being cost-efficient. The number of genomic fragments created through restriction enzyme digestion and the sequencing library setup must match to achieve sufficient sequencing coverage per locus. Here, we present a workflow based on published information and computational and experimental procedures to investigate and streamline the applicability of RRS. In an iterative process genome size estimates, restriction enzymes and size selection windows were tested and scaled in six classes of Antarctic animals (Ostracoda, Malacostraca, Bivalvia, Asteroidea, Actinopterygii, Aves). Achieving high marker density would be expensive in amphipods, the malacostracan target taxon, due to the large genome size. We propose alternative approaches such as mitogenome or target capture sequencing for this group. Pilot libraries were sequenced for all other target taxa. Ostracods, bivalves, sea stars, and fish showed overall good coverage and marker numbers for downstream population genomic analyses. In contrast, the bird test library produced low coverage and few polymorphic loci, likely due to degraded DNA. Prior testing and optimization are important to identify which groups are amenable for RRS and where alternative methods may currently offer better cost-benefit ratios. The steps outlined here are easy to follow for other non-model taxa with little genomic resources, thus stimulating efficient resource use for the many pressing research questions in molecular ecology.

Facilitating population genomics of non-model organisms through optimized experimental design for reduced representation sequencing

BMC Genomics ◽

10.1186/s12864-021-07917-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Henrik Christiansen ◽

Franz M. Heindler ◽

Bart Hellemans ◽

Quentin Jossart ◽

Francesca Pasotti ◽

...

Keyword(s):

Genome Size ◽

Fine Tuning ◽

Alternative Methods ◽

Model Organisms ◽

Restriction Enzyme Digestion ◽

Marker Density ◽

Reduced Representation ◽

A Genome ◽

High Marker Density ◽

Reduced Representation Sequencing

Abstract Background Genome-wide data are invaluable to characterize differentiation and adaptation of natural populations. Reduced representation sequencing (RRS) subsamples a genome repeatedly across many individuals. However, RRS requires careful optimization and fine-tuning to deliver high marker density while being cost-efficient. The number of genomic fragments created through restriction enzyme digestion and the sequencing library setup must match to achieve sufficient sequencing coverage per locus. Here, we present a workflow based on published information and computational and experimental procedures to investigate and streamline the applicability of RRS. Results In an iterative process genome size estimates, restriction enzymes and size selection windows were tested and scaled in six classes of Antarctic animals (Ostracoda, Malacostraca, Bivalvia, Asteroidea, Actinopterygii, Aves). Achieving high marker density would be expensive in amphipods, the malacostracan target taxon, due to the large genome size. We propose alternative approaches such as mitogenome or target capture sequencing for this group. Pilot libraries were sequenced for all other target taxa. Ostracods, bivalves, sea stars, and fish showed overall good coverage and marker numbers for downstream population genomic analyses. In contrast, the bird test library produced low coverage and few polymorphic loci, likely due to degraded DNA. Conclusions Prior testing and optimization are important to identify which groups are amenable for RRS and where alternative methods may currently offer better cost-benefit ratios. The steps outlined here are easy to follow for other non-model taxa with little genomic resources, thus stimulating efficient resource use for the many pressing research questions in molecular ecology.

Error, noise and bias in de novo transcriptome assemblies

10.1101/585745 ◽

2019 ◽

Cited By ~ 3

Author(s):

Adam H. Freedman ◽

Michele Clamp ◽

Timothy B. Sackton

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Error Rates ◽

Computational Method ◽

Model Organisms ◽

Data Sets ◽

Protein Coding ◽

Gene Level ◽

Assembly Algorithms ◽

The Impact

ABSTRACTDe novo transcriptome assembly is a powerful tool, widely used over the last decade for making evolutionary inferences. However, it relies on two implicit assumptions: that the assembled transcriptome is an unbiased representation of the underlying expressed transcriptome, and that expression estimates from the assembly are good, if noisy approximations of the relative abundance of expressed transcripts. Using publicly available data for model organisms, we demonstrate that, across assembly algorithms and data sets, these assumptions are consistently violated. Bias exists at the nucleotide level, with genotyping error rates ranging from 30-83%. As a result, diversity is underestimated in transcriptome assemblies, with consistent under-estimation of heterozygosity in all but the most inbred samples. Even at the gene level, expression estimates show wide deviations from map-to-reference estimates, and positive bias at lower expression levels. Standard filtering of transcriptome assemblies improves the robustness of gene expression estimates but leads to the loss of a meaningful number of protein-coding genes, including many that are highly expressed. We demonstrate a computational method, length-rescaled CPM, to partly alleviate noise and bias in expression estimates. Researchers should consider ways to minimize the impact of bias in transcriptome assemblies.

Mean reversion in corporate leverage: evidence from India

Managerial Finance ◽

10.1108/mf-09-2018-0425 ◽

2019 ◽

Vol 45 (9) ◽

pp. 1183-1198

Author(s):

Gaurav S. Chauhan ◽

Pradip Banerjee

Keyword(s):

Capital Structure ◽

Emerging Market ◽

Simulated Data ◽

Mean Reversion ◽

Developed Countries ◽

Data Sets ◽

Debt Ratio ◽

Testing Strategy ◽

Content Type ◽

Financing Behavior

Purpose Recent papers on target capital structure show that debt ratio seems to vary widely in space and time, implying that the functional specifications of target debt ratios are of little empirical use. Further, target behavior cannot be adjudged correctly using debt ratios, as they could revert due to mechanical reasons. The purpose of this paper is to develop an alternative testing strategy to test the target capital structure. Design/methodology/approach The authors make use of a major “shock” to the debt ratios as an event and think of a subsequent reversion as a movement toward a mean or target debt ratio. By doing this, the authors no longer need to identify target debt ratios as a function of firm-specific variables or any other rigid functional form. Findings Similar to the broad empirical evidence in developed economies, there is no perceptible and systematic mean reversion by Indian firms. However, unlike developed countries, proportionate usage of debt to finance firms’ marginal financing deficits is extensive; equity is used rather sparingly. Research limitations/implications The trade-off theory could be convincingly refuted at least for the emerging market of India. The paper here stimulated further research on finding reasons for specific financing behavior of emerging market firms. Practical implications The results show that the firms’ financing choices are not only depending on their own firm’s specific variables but also on the financial markets in which they operate. Originality/value This study attempts to assess mean reversion in debt ratios in a unique but reassuring manner. The results are confirmed by extensive calibration of the testing strategy using simulated data sets.