scholarly journals qc3C: reference-free quality control for Hi-C sequencing data

2021 ◽  
Author(s):  
Matthew Z. DeMaere ◽  
Aaron E. Darling

AbstractHi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide spatial interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies and more recently the accurate resolution of metagenome-assembled genomes (MAGs). Despite continued refinements, however, Hi-C library preparation remains a complex laboratory protocol and diligent quality management is recommended to avoid costly failure. Current wet-lab protocols for Hi-C library QC provide only a crude assay, while commonly used sequence-based QC methods demand a reference genome; the quality of which can skew results. We propose a new, reference-free approach for Hi-C library quality assessment that requires only a modest amount of sequencing data. The algorithm builds upon the observation that proximity ligation events are likely to create k -mers that would not naturally occur in the sample. Our software tool (qc3C) is to our knowledge the first to implement a reference-free Hi-C QC tool, and also provides reference-based QC, enabling Hi-C to be more easily applied to non-model organisms and environmental samples. We characterise the accuracy of the new algorithm on simulated and real datasets and compare it to reference-based methods.

2021 ◽  
Vol 17 (10) ◽  
pp. e1008839
Author(s):  
Matthew Z. DeMaere ◽  
Aaron E. Darling

Hi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide spatial interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies and more recently the accurate resolution of metagenome-assembled genomes (MAGs). Despite continued refinements, however, preparing a Hi-C library remains a complex laboratory protocol. To avoid costly failures and maximise the odds of successful outcomes, diligent quality management is recommended. Current wet-lab methods provide only a crude assay of Hi-C library quality, while key post-sequencing quality indicators used have—thus far—relied upon reference-based read-mapping. When a reference is accessible, this reliance introduces a concern for quality, where an incomplete or inexact reference skews the resulting quality indicators. We propose a new, reference-free approach that infers the total fraction of read-pairs that are a product of proximity ligation. This quantification of Hi-C library quality requires only a modest amount of sequencing data and is independent of other application-specific criteria. The algorithm builds upon the observation that proximity ligation events are likely to create k-mers that would not naturally occur in the sample. Our software tool (qc3C) is to our knowledge the first to implement a reference-free Hi-C QC tool, and also provides reference-based QC, enabling Hi-C to be more easily applied to non-model organisms and environmental samples. We characterise the accuracy of the new algorithm on simulated and real datasets and compare it to reference-based methods.


2015 ◽  
Vol 9S4 ◽  
pp. BBI.S29333 ◽  
Author(s):  
Stefan E. Seemann ◽  
Christian Anthon ◽  
Oana Palasca ◽  
Jan Gorodkin

The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNA seq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.


MycoKeys ◽  
2018 ◽  
Vol 39 ◽  
pp. 29-40 ◽  
Author(s):  
Sten Anslan ◽  
R. Henrik Nilsson ◽  
Christian Wurzbacher ◽  
Petr Baldrian ◽  
Leho Tedersoo ◽  
...  

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.


2018 ◽  
Author(s):  
Susanne Tilk ◽  
Alan Bergland ◽  
Aaron Goodman ◽  
Paul Schmidt ◽  
Dmitri Petrov ◽  
...  

AbstractEvolve-and-resequence (E+R) experiments leverage next-generation sequencing technology to track the allele frequency dynamics of populations as they evolve. While previous work has shown that adaptive alleles can be detected by comparing frequency trajectories from many replicate populations, this power comes at the expense of high-coverage (>100x) sequencing of many pooled samples, which can be cost-prohibitive. Here, we show that accurate estimates of allele frequencies can be achieved with very shallow sequencing depths (<5x) via inference of known founder haplotypes in small genomic windows. This technique can be used to efficiently estimate frequencies for any number of bi-allelic SNPs in populations of any model organism founded with sequenced homozygous strains. Using both experimentally-pooled and simulated samples of Drosophila melanogaster, we show that haplotype inference can improve allele frequency accuracy by orders of magnitude for up to 50 generations of recombination, and is robust to moderate levels of missing data, as well as different selection regimes. Finally, we show that a simple linear model generated from these simulations can predict the accuracy of haplotype-derived allele frequencies in other model organisms and experimental designs. To make these results broadly accessible for use in E+R experiments, we introduce HAF-pipe, an open-source software tool for calculating haplotype-derived allele frequencies from raw sequencing data. Ultimately, by reducing sequencing costs without sacrificing accuracy, our method facilitates E+R designs with higher replication and resolution, and thereby, increased power to detect adaptive alleles.


BMC Genetics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Liping Guan ◽  
Ke Cao ◽  
Yong Li ◽  
Jian Guo ◽  
Qiang Xu ◽  
...  

Abstract Background Peach (Prunus persica L.) is a diploid species and model plant of the Rosaceae family. In the past decade, significant progress has been made in peach genetic research via DNA markers, but the number of these markers remains limited. Results In this study, we performed a genome-wide DNA markers detection based on sequencing data of six distantly related peach accessions. A total of 650,693~1,053,547 single nucleotide polymorphisms (SNPs), 114,227~178,968 small insertion/deletions (InDels), 8386~12,298 structure variants (SVs), 2111~2581 copy number variants (CNVs) and 229,357~346,940 simple sequence repeats (SSRs) were detected and annotated. To demonstrate the application of DNA markers, 944 SNPs were filtered for association study of fruit ripening time and 15 highly polymorphic SSRs were selected to analyze the genetic relationship among 221 accessions. Conclusions The results showed that the use of high-throughput sequencing to develop DNA markers is fast and effective. Comprehensive identification of DNA markers, including SVs and SSRs, would be of benefit to genetic diversity evaluation, genetic mapping, and molecular breeding of peach.


2015 ◽  
Vol 9S4 ◽  
pp. BBI.S29334 ◽  
Author(s):  
Jessica P. Hekman ◽  
Jennifer L Johnson ◽  
Anna V. Kukekova

Domesticated species occupy a special place in the human world due to their economic and cultural value. In the era of genomic research, domesticated species provide unique advantages for investigation of diseases and complex phenotypes. RNA sequencing, or RNA-seq, has recently emerged as a new approach for studying transcriptional activity of the whole genome, changing the focus from individual genes to gene networks. RNA-seq analysis in domesticated species may complement genome-wide association studies of complex traits with economic importance or direct relevance to biomedical research. However, RNA-seq studies are more challenging in domesticated species than in model organisms. These challenges are at least in part associated with the lack of quality genome assemblies for some domesticated species and the absence of genome assemblies for others. In this review, we discuss strategies for analyzing RNA-seq data, focusing particularly on questions and examples relevant to domesticated species.


2019 ◽  
Author(s):  
Kevin H.-C. Wei ◽  
Aditya Mantha ◽  
Doris Bachtrog

ABSTRACTRecombination is the exchange of genetic material between homologous chromosomes via physical crossovers. Pioneered by T. H. Morgan and A. Sturtevant over a century ago, methods to estimate recombination rate and genetic distance require scoring large number of recombinant individuals between molecular or visible markers. While high throughput sequencing methods have allowed for genome wide crossover detection producing high resolution maps, such methods rely on large number of recombinants individually sequenced and are therefore difficult to scale. Here, we present a simple and scalable method to infer near chromosome-wide recombination rate from marker selected pools and the corresponding analytical software MarSuPial. Rather than genotyping individuals from recombinant backcrosses, we bulk sequence marker selected pools to infer the allele frequency decay around the selected locus; since the number of recombinant individuals increases proportionally to the genetic distance from the selected locus, the allele frequency across the chromosome can be used to estimate the genetic distance and recombination rate. We mathematically demonstrate the relationship between allele frequency attenuation, recombinant fraction, genetic distance, and recombination rate in marker selected pools. Based on available chromosome-wide recombination rate models of Drosophila, we simulated read counts and determined that nonlinear local regressions (LOESS) produce robust estimates despite the high noise inherent to sequencing data. To empirically validate this approach, we show that (single) marker selected pools closely recapitulate genetic distances inferred from scoring recombinants between double markers. We theoretically determine how secondary loci with viability impacts can modulate the allele frequency decay and how to account for such effects directly from the data. We generated the recombinant map of three wild derived strains which strongly correlates with previous genome-wide measurements. Interestingly, amidst extensive recombination rate variation, multiple regions of the genomes show elevated rates across all strains. Lastly, we apply this method to estimate chromosome-wide crossover interference. Altogether, we find that marker selected pools is a simple and cost effective method for broad recombination rate estimates. Although it does not identify instances of crossovers, it can generate near chromosome-wide recombination maps in as little as one or two libraries.


2019 ◽  
Author(s):  
Liping Guan ◽  
ke Cao ◽  
yong Li ◽  
jian guo ◽  
qiang xu ◽  
...  

Abstract Background: Peach (Prunus persica L.) is a diploid species and model plant of the Rosaceae family. In the past decade, significant progress has been made in peach genetic research via DNA markers, but the number of these markers remains limited. Results: In this study, we performed a genome-wide DNA markers detection based on sequencing data of six distantly related peach accessions. A total of 650,693~1,053,547 single nucleotide polymorphisms (SNPs), 114,227~178,968 small insertion/deletions (InDels), 8,386~12,298 structure variants (SVs), 2,111~2,581 copy number variants (CNVs) and 229,357~346,940 simple sequence repeats (SSRs) were detected and annotated. To demonstrate the application of DNA markers, 944 SNPs were filtered for association study of fruit ripening time and 15 highly polymorphic SSRs were selected to analyze the genetic relationship among 221 accessions. Conclusions: The results showed that the use of high-throughput sequencing to develop DNA markers is fast and effective. Comprehensive identification of DNA markers, including SVs and SSRs, would be of benefit to genetic diversity evaluation, genetic mapping, and molecular breeding of peach.


2018 ◽  
Author(s):  
Mark Hills ◽  
Ester Falconer ◽  
Kieran O’Neil ◽  
Ashley D. Sanders ◽  
Kerstin Howe ◽  
...  

Accurate reference genome sequences provide the foundation for modern molecular biology and genomics as the interpretation of sequence data to study evolution, gene expression and epigenetics depends heavily on the quality of the genome assembly used for its alignment. Correctly organising sequenced fragments such as contigs and scaffolds in relation to each other is a critical and often challenging step in the construction of robust genome references. We previously identified misoriented regions in the mouse and human reference assemblies using Strand-seq, a single cell sequencing technique that preserves DNA directionality1, 2. Here we demonstrate the ability of Strand-seq to build and correct full-length chromosomes, by identifying which scaffolds belong to the same chromosome and determining their correct order and orientation, without the need for overlapping sequences. We demonstrate that Strand-seq exquisitely maps assembly fragments into large related groups and chromosome-sized clusters without using new assembly data. Using template strand inheritance as a bi-allelic marker, we employ genetic mapping principles to cluster scaffolds that are derived from the same chromosome and order them within the chromosome based solely on directionality of DNA strand inheritance. We prove the utility of our approach by generating improved genome assemblies for several model organisms including the ferret, pig, Xenopus, zebrafish, Tasmanian devil and the Guinea pig.


Sign in / Sign up

Export Citation Format

Share Document