scholarly journals Read coverage as an indicator of misassembly in a short-read based genome assembly

2019 ◽  
Author(s):  
Peipei Wang ◽  
Fanrui Meng ◽  
Bethany M. Moore ◽  
Shin-Han Shiu

ABSTRACTAvailability of genome sequences has led to significant advance in biology. With few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues. In tomato, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. We established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have lower simple sequence repeat but higher tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially mis-assembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a machine learning model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to misassembly when using short reads.

2021 ◽  
Author(s):  
Peipei Wang ◽  
Fanrui Meng ◽  
Bethany M. Moore ◽  
Shin-Han Shiu

Abstract Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.


2020 ◽  
Author(s):  
Peipei Wang ◽  
Fanrui Meng ◽  
Bethany M. Moore ◽  
Shin-Han Shiu

Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.


2021 ◽  
Author(s):  
Peipei Wang ◽  
Fanrui Meng ◽  
Bethany M. Moore ◽  
Shin-Han Shiu

Abstract Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Peipei Wang ◽  
Fanrui Meng ◽  
Bethany M. Moore ◽  
Shin-Han Shiu

Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.


2020 ◽  
Author(s):  
Lauren Coombe ◽  
Vladimir Nikolić ◽  
Justin Chu ◽  
Inanc Birol ◽  
René L. Warren

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]


Author(s):  
Nathan D. Olson ◽  
Justin Wagner ◽  
Jennifer McDaniel ◽  
Sarah H. Stephens ◽  
Samuel T. Westreich ◽  
...  

SummaryThe precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with FASTQ files, 20 challenge participants applied their variant calling pipelines and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based and machine-learning methods scoring best for short-read and long-read datasets, respectively. New methods out-performed the 2016 Truth Challenge winners, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.


2020 ◽  
Vol 36 (12) ◽  
pp. 3885-3887 ◽  
Author(s):  
Lauren Coombe ◽  
Vladimir Nikolić ◽  
Justin Chu ◽  
Inanc Birol ◽  
René L Warren

Abstract Summary The ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short-read assembly with a draft long-read assembly and a draft assembly with an assembly from a closely related species. When scaffolding a human short-read assembly using the reference human genome or a long-read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using <11 GB of RAM. Compared to existing reference-guided scaffolders, ntJoin generates highly contiguous assemblies faster and using less memory. Availability and implementation ntJoin is written in C++ and Python and is freely available at https://github.com/bcgsc/ntjoin. Supplementary information Supplementary data are available at Bioinformatics online.


2022 ◽  
Author(s):  
Valentina Peona ◽  
Mozes Blom ◽  
Carolina Frankl-Vilches ◽  
Borja Milá ◽  
Hidayat Ashari ◽  
...  

Structural variants (SVs) are DNA mutations that can have relevant effects at micro- and macro-evolutionary scales. The detection of SVs is largely limited by the type and quality of sequencing technologies adopted, therefore genetic variability linked to SVs may remain undiscovered, especially in complex repetitive genomic regions. In this study, we used a combination of long-read and linked-read genome assemblies to investigate the occurrence of insertions and dele-tions across the chromosomes of 14 species of birds-of-paradise and two species of estrildid finches including highly repetitive W chro-mosomes. The species sampling encompasses most genera and representatives from all major clades of birds-of-paradise, allowing comparisons between individuals of the same species, genus, and family. We found the highest densities of SVs to be located on the microchromosomes and on the female-specific W chromosome. Genome assemblies of multiple individuals from the same species allowed us to compare the levels of genetic variability linked to SVs and single nucleotide polymorphisms (SNPs) on the W and other chromosomes. Our results demonstrate that the avian W chromosome harbours more genetic variability than previously thought and that its structure is shaped by the continuous accumulation and turn-over of transposable element insertions, especially endogenous retroviruses.


2016 ◽  
Author(s):  
Minh Duc Cao ◽  
Son Hoang Nguyen ◽  
Devika Ganesamoorthy ◽  
Alysha G. Elliott ◽  
Matthew Cooper ◽  
...  

AbstractGenome assemblies obtained from short read sequencing technologies are often fragmented into many contigs because of the abundance of repetitive sequences. Long read sequencing technologies allow the generation of reads spanning most repeat sequences, providing the opportunity to complete these genome assemblies. However, substantial amounts of sequence data and computational resources are required to overcome the high per-base error rate inherent to these technologies. Furthermore, most existing methods only assemble the genomes after sequencing has completed which could result in either generation of more sequence data at greater cost than required or a low-quality assembly if insufficient data are generated. Here we present the first computational method which utilises real-time nanopore sequencing to scaffold and complete short-read assemblies while the long read sequence data is being generated. The method reports the progress of completing the assembly in real-time so users can terminate the sequencing once an assembly of sufficient quality and completeness is obtained. We use our method to complete four bacterial genomes and one eukaryotic genome, and show that it is able to construct more complete and more accurate assemblies, and at the same time, requires less sequencing data and computational resources than existing pipelines. We also demonstrate that the method can facilitate real-time analyses of positional information such as identification of bacterial genes encoded in plasmids and pathogenicity islands.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Chong Chu ◽  
Rebeca Borges-Monroy ◽  
Vinayak V. Viswanadham ◽  
Soohyun Lee ◽  
Heng Li ◽  
...  

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.


Sign in / Sign up

Export Citation Format

Share Document