qc3C: reference-free quality control for Hi-C sequencing data

Software Tool ◽

Model Organisms ◽

Sequencing Data ◽

Proximity Ligation ◽

Genome Wide ◽

Costly Failure ◽

Wet Lab ◽

AbstractHi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide spatial interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies and more recently the accurate resolution of metagenome-assembled genomes (MAGs). Despite continued refinements, however, Hi-C library preparation remains a complex laboratory protocol and diligent quality management is recommended to avoid costly failure. Current wet-lab protocols for Hi-C library QC provide only a crude assay, while commonly used sequence-based QC methods demand a reference genome; the quality of which can skew results. We propose a new, reference-free approach for Hi-C library quality assessment that requires only a modest amount of sequencing data. The algorithm builds upon the observation that proximity ligation events are likely to create k -mers that would not naturally occur in the sample. Our software tool (qc3C) is to our knowledge the first to implement a reference-free Hi-C QC tool, and also provides reference-based QC, enabling Hi-C to be more easily applied to non-model organisms and environmental samples. We characterise the accuracy of the new algorithm on simulated and real datasets and compare it to reference-based methods.

qc3C: Reference-free quality control for Hi-C sequencing data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008839 ◽

2021 ◽

Vol 17 (10) ◽

pp. e1008839

Author(s):

Matthew Z. DeMaere ◽

Aaron E. Darling

Keyword(s):

Quality Indicators ◽

Software Tool ◽

Model Organisms ◽

Sequencing Data ◽

Proximity Ligation ◽

Genome Wide ◽

Sequencing Quality ◽

Wet Lab ◽

Hi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide spatial interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies and more recently the accurate resolution of metagenome-assembled genomes (MAGs). Despite continued refinements, however, preparing a Hi-C library remains a complex laboratory protocol. To avoid costly failures and maximise the odds of successful outcomes, diligent quality management is recommended. Current wet-lab methods provide only a crude assay of Hi-C library quality, while key post-sequencing quality indicators used have—thus far—relied upon reference-based read-mapping. When a reference is accessible, this reliance introduces a concern for quality, where an incomplete or inexact reference skews the resulting quality indicators. We propose a new, reference-free approach that infers the total fraction of read-pairs that are a product of proximity ligation. This quantification of Hi-C library quality requires only a modest amount of sequencing data and is independent of other application-specific criteria. The algorithm builds upon the observation that proximity ligation events are likely to create k-mers that would not naturally occur in the sample. Our software tool (qc3C) is to our knowledge the first to implement a reference-free Hi-C QC tool, and also provides reference-based QC, enabling Hi-C to be more easily applied to non-model organisms and environmental samples. We characterise the accuracy of the new algorithm on simulated and real datasets and compare it to reference-based methods.

Quality Assessment of Domesticated Animal Genome Assemblies

Bioinformatics and Biology Insights ◽

10.4137/bbi.s29333 ◽

2015 ◽

Vol 9S4 ◽

pp. BBI.S29333 ◽

Cited By ~ 3

Author(s):

Stefan E. Seemann ◽

Christian Anthon ◽

Oana Palasca ◽

Jan Gorodkin

Keyword(s):

Genomic Sequence ◽

Rna Seq ◽

Sequencing Data ◽

Assembly Quality ◽

High Quality ◽

Rnaseq Data ◽

Genome Assemblies ◽

Animal Genomes

The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNA seq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.

Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding

MycoKeys ◽

10.3897/mycokeys.39.28109 ◽

2018 ◽

Vol 39 ◽

pp. 29-40 ◽

Cited By ~ 21

Author(s):

Sten Anslan ◽

R. Henrik Nilsson ◽

Christian Wurzbacher ◽

Petr Baldrian ◽

Leho Tedersoo ◽

...

Keyword(s):

High Throughput ◽

Computation Time ◽

Potential Effect ◽

Data Sets ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

High Throughput Sequencing Data ◽

Recent Developments

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.

Accurate allele frequencies from ultra-low coverage pool-seq samples in evolve-and-resequence experiments

10.1101/244004 ◽

2018 ◽

Author(s):

Susanne Tilk ◽

Alan Bergland ◽

Aaron Goodman ◽

Paul Schmidt ◽

Dmitri Petrov ◽

...

Keyword(s):

Allele Frequency ◽

Model Organism ◽

Software Tool ◽

Allele Frequencies ◽

Model Organisms ◽

Sequencing Data ◽

High Coverage ◽

Next Generation Sequencing Technology ◽

Low Coverage ◽

Pooled Samples

AbstractEvolve-and-resequence (E+R) experiments leverage next-generation sequencing technology to track the allele frequency dynamics of populations as they evolve. While previous work has shown that adaptive alleles can be detected by comparing frequency trajectories from many replicate populations, this power comes at the expense of high-coverage (>100x) sequencing of many pooled samples, which can be cost-prohibitive. Here, we show that accurate estimates of allele frequencies can be achieved with very shallow sequencing depths (<5x) via inference of known founder haplotypes in small genomic windows. This technique can be used to efficiently estimate frequencies for any number of bi-allelic SNPs in populations of any model organism founded with sequenced homozygous strains. Using both experimentally-pooled and simulated samples of Drosophila melanogaster, we show that haplotype inference can improve allele frequency accuracy by orders of magnitude for up to 50 generations of recombination, and is robust to moderate levels of missing data, as well as different selection regimes. Finally, we show that a simple linear model generated from these simulations can predict the accuracy of haplotype-derived allele frequencies in other model organisms and experimental designs. To make these results broadly accessible for use in E+R experiments, we introduce HAF-pipe, an open-source software tool for calculating haplotype-derived allele frequencies from raw sequencing data. Ultimately, by reducing sequencing costs without sacrificing accuracy, our method facilitates E+R designs with higher replication and resolution, and thereby, increased power to detect adaptive alleles.

Detection and application of genome-wide variations in peach for association and genetic relationship analysis

BMC Genetics ◽

10.1186/s12863-019-0799-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Liping Guan ◽

Ke Cao ◽

Yong Li ◽

Jian Guo ◽

Qiang Xu ◽

...

Keyword(s):

Genetic Relationship ◽

Dna Markers ◽

Prunus Persica ◽

Genetic Research ◽

Diploid Species ◽

Sequencing Data ◽

Relationship Analysis ◽

Genome Wide ◽

A Genome

Abstract Background Peach (Prunus persica L.) is a diploid species and model plant of the Rosaceae family. In the past decade, significant progress has been made in peach genetic research via DNA markers, but the number of these markers remains limited. Results In this study, we performed a genome-wide DNA markers detection based on sequencing data of six distantly related peach accessions. A total of 650,693~1,053,547 single nucleotide polymorphisms (SNPs), 114,227~178,968 small insertion/deletions (InDels), 8386~12,298 structure variants (SVs), 2111~2581 copy number variants (CNVs) and 229,357~346,940 simple sequence repeats (SSRs) were detected and annotated. To demonstrate the application of DNA markers, 944 SNPs were filtered for association study of fruit ripening time and 15 highly polymorphic SSRs were selected to analyze the genetic relationship among 221 accessions. Conclusions The results showed that the use of high-throughput sequencing to develop DNA markers is fast and effective. Comprehensive identification of DNA markers, including SVs and SSRs, would be of benefit to genetic diversity evaluation, genetic mapping, and molecular breeding of peach.

Genome-Wide Estimation of Linkage Disequilibrium from Population-Level High-Throughput Sequencing Data

Genetics ◽

10.1534/genetics.114.165514 ◽

2014 ◽

Vol 197 (4) ◽

pp. 1303-1313 ◽

Cited By ~ 16

Author(s):

Takahiro Maruki ◽

Michael Lynch

Keyword(s):

Linkage Disequilibrium ◽

High Throughput ◽

Population Level ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Genome Wide

Transcriptome Analysis in Domesticated Species: Challenges and Strategies

Bioinformatics and Biology Insights ◽

10.4137/bbi.s29334 ◽

2015 ◽

Vol 9S4 ◽

pp. BBI.S29334 ◽

Cited By ~ 4

Author(s):

Jessica P. Hekman ◽

Jennifer L Johnson ◽

Anna V. Kukekova

Keyword(s):

Complex Traits ◽

Gene Networks ◽

Association Studies ◽

Cultural Value ◽

Genomic Research ◽

Model Organisms ◽

Genome Wide Association Studies ◽

Rna Seq ◽

Genome Wide ◽

Domesticated species occupy a special place in the human world due to their economic and cultural value. In the era of genomic research, domesticated species provide unique advantages for investigation of diseases and complex phenotypes. RNA sequencing, or RNA-seq, has recently emerged as a new approach for studying transcriptional activity of the whole genome, changing the focus from individual genes to gene networks. RNA-seq analysis in domesticated species may complement genome-wide association studies of complex traits with economic importance or direct relevance to biomedical research. However, RNA-seq studies are more challenging in domesticated species than in model organisms. These challenges are at least in part associated with the lack of quality genome assemblies for some domesticated species and the absence of genome assemblies for others. In this review, we discuss strategies for analyzing RNA-seq data, focusing particularly on questions and examples relevant to domesticated species.

The theory and practice of measuring broad-range recombination rate from marker selected pools

10.1101/762575 ◽

2019 ◽

Author(s):

Kevin H.-C. Wei ◽

Aditya Mantha ◽

Doris Bachtrog

Keyword(s):

Genetic Distance ◽

Allele Frequency ◽

Recombination Rate ◽

Genetic Material ◽

Cost Effective ◽

Theory And Practice ◽

Rate Variation ◽

Sequencing Data ◽

Genome Wide

ABSTRACTRecombination is the exchange of genetic material between homologous chromosomes via physical crossovers. Pioneered by T. H. Morgan and A. Sturtevant over a century ago, methods to estimate recombination rate and genetic distance require scoring large number of recombinant individuals between molecular or visible markers. While high throughput sequencing methods have allowed for genome wide crossover detection producing high resolution maps, such methods rely on large number of recombinants individually sequenced and are therefore difficult to scale. Here, we present a simple and scalable method to infer near chromosome-wide recombination rate from marker selected pools and the corresponding analytical software MarSuPial. Rather than genotyping individuals from recombinant backcrosses, we bulk sequence marker selected pools to infer the allele frequency decay around the selected locus; since the number of recombinant individuals increases proportionally to the genetic distance from the selected locus, the allele frequency across the chromosome can be used to estimate the genetic distance and recombination rate. We mathematically demonstrate the relationship between allele frequency attenuation, recombinant fraction, genetic distance, and recombination rate in marker selected pools. Based on available chromosome-wide recombination rate models of Drosophila, we simulated read counts and determined that nonlinear local regressions (LOESS) produce robust estimates despite the high noise inherent to sequencing data. To empirically validate this approach, we show that (single) marker selected pools closely recapitulate genetic distances inferred from scoring recombinants between double markers. We theoretically determine how secondary loci with viability impacts can modulate the allele frequency decay and how to account for such effects directly from the data. We generated the recombinant map of three wild derived strains which strongly correlates with previous genome-wide measurements. Interestingly, amidst extensive recombination rate variation, multiple regions of the genomes show elevated rates across all strains. Lastly, we apply this method to estimate chromosome-wide crossover interference. Altogether, we find that marker selected pools is a simple and cost effective method for broad recombination rate estimates. Although it does not identify instances of crossovers, it can generate near chromosome-wide recombination maps in as little as one or two libraries.

Detection and application of genome-wide variations in peach for association and genetic relationship analysis

10.21203/rs.2.10634/v3 ◽

2019 ◽

Author(s):

Liping Guan ◽

ke Cao ◽

yong Li ◽

jian guo ◽

qiang xu ◽

...

Keyword(s):

Genetic Relationship ◽

Dna Markers ◽

Prunus Persica ◽

Genetic Research ◽

Diploid Species ◽

Sequencing Data ◽

Relationship Analysis ◽

Genome Wide ◽

A Genome

Abstract Background: Peach (Prunus persica L.) is a diploid species and model plant of the Rosaceae family. In the past decade, significant progress has been made in peach genetic research via DNA markers, but the number of these markers remains limited. Results: In this study, we performed a genome-wide DNA markers detection based on sequencing data of six distantly related peach accessions. A total of 650,693~1,053,547 single nucleotide polymorphisms (SNPs), 114,227~178,968 small insertion/deletions (InDels), 8,386~12,298 structure variants (SVs), 2,111~2,581 copy number variants (CNVs) and 229,357~346,940 simple sequence repeats (SSRs) were detected and annotated. To demonstrate the application of DNA markers, 944 SNPs were filtered for association study of fruit ripening time and 15 highly polymorphic SSRs were selected to analyze the genetic relationship among 221 accessions. Conclusions: The results showed that the use of high-throughput sequencing to develop DNA markers is fast and effective. Comprehensive identification of DNA markers, including SVs and SSRs, would be of benefit to genetic diversity evaluation, genetic mapping, and molecular breeding of peach.

Construction of whole genomes from scaffolds using single cell strand-seq data

10.1101/271510 ◽

2018 ◽

Cited By ~ 4

Author(s):

Mark Hills ◽

Ester Falconer ◽

Kieran O’Neil ◽

Ashley D. Sanders ◽

Kerstin Howe ◽

...

Keyword(s):

Single Cell ◽

Sequence Data ◽

Model Organisms ◽

Tasmanian Devil ◽

Template Strand ◽

Modern Molecular Biology ◽

Whole Genomes ◽

Dna Strand ◽

Accurate reference genome sequences provide the foundation for modern molecular biology and genomics as the interpretation of sequence data to study evolution, gene expression and epigenetics depends heavily on the quality of the genome assembly used for its alignment. Correctly organising sequenced fragments such as contigs and scaffolds in relation to each other is a critical and often challenging step in the construction of robust genome references. We previously identified misoriented regions in the mouse and human reference assemblies using Strand-seq, a single cell sequencing technique that preserves DNA directionality1, 2. Here we demonstrate the ability of Strand-seq to build and correct full-length chromosomes, by identifying which scaffolds belong to the same chromosome and determining their correct order and orientation, without the need for overlapping sequences. We demonstrate that Strand-seq exquisitely maps assembly fragments into large related groups and chromosome-sized clusters without using new assembly data. Using template strand inheritance as a bi-allelic marker, we employ genetic mapping principles to cluster scaffolds that are derived from the same chromosome and order them within the chromosome based solely on directionality of DNA strand inheritance. We prove the utility of our approach by generating improved genome assemblies for several model organisms including the ferret, pig, Xenopus, zebrafish, Tasmanian devil and the Guinea pig.