MosaicFlye: Resolving Long Mosaic Repeats Using Long Reads

Author(s):  
Anton Bankevich ◽  
Pavel Pevzner
Keyword(s):  
2020 ◽  
Vol 15 ◽  
Author(s):  
Hongdong Li ◽  
Wenjing Zhang ◽  
Yuwen Luo ◽  
Jianxin Wang

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Omar Abou Saada ◽  
Andreas Tsouris ◽  
Chris Eberlein ◽  
Anne Friedrich ◽  
Joseph Schacherer

AbstractWhile genome sequencing and assembly are now routine, we do not have a full, precise picture of polyploid genomes. No existing polyploid phasing method provides accurate and contiguous haplotype predictions. We developed nPhase, a ploidy agnostic tool that leverages long reads and accurate short reads to solve alignment-based phasing for samples of unspecified ploidy (https://github.com/OmarOakheart/nPhase). nPhase is validated by tests on simulated and real polyploids. nPhase obtains on average over 95% accuracy and a contiguous 1.25 haplotigs per haplotype to cover more than 90% of each chromosome (heterozygosity rate ≥ 0.5%). nPhase allows population genomics and hybrid studies of polyploids.


Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Benjamin J. Callahan ◽  
Dmitry Grinevich ◽  
Siddhartha Thakur ◽  
Michael A. Balamotis ◽  
Tuval Ben Yehezkel

Abstract Background Out of the many pathogenic bacterial species that are known, only a fraction are readily identifiable directly from a complex microbial community using standard next generation DNA sequencing. Long-read sequencing offers the potential to identify a wider range of species and to differentiate between strains within a species, but attaining sufficient accuracy in complex metagenomes remains a challenge. Methods Here, we describe and analytically validate LoopSeq, a commercially available synthetic long-read (SLR) sequencing technology that generates highly accurate long reads from standard short reads. Results LoopSeq reads are sufficiently long and accurate to identify microbial genes and species directly from complex samples. LoopSeq perfectly recovered the full diversity of 16S rRNA genes from known strains in a synthetic microbial community. Full-length LoopSeq reads had a per-base error rate of 0.005%, which exceeds the accuracy reported for other long-read sequencing technologies. 18S-ITS and genomic sequencing of fungal and bacterial isolates confirmed that LoopSeq sequencing maintains that accuracy for reads up to 6 kb in length. LoopSeq full-length 16S rRNA reads could accurately classify organisms down to the species level in rinsate from retail meat samples, and could differentiate strains within species identified by the CDC as potential foodborne pathogens. Conclusions The order-of-magnitude improvement in length and accuracy over standard Illumina amplicon sequencing achieved with LoopSeq enables accurate species-level and strain identification from complex- to low-biomass microbiome samples. The ability to generate accurate and long microbiome sequencing reads using standard short read sequencers will accelerate the building of quality microbial sequence databases and removes a significant hurdle on the path to precision microbial genomics.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Seth Commichaux ◽  
Kiran Javkar ◽  
Padmini Ramachandran ◽  
Niranjan Nagarajan ◽  
Denis Bertrand ◽  
...  

Abstract Background Whole genome sequencing of cultured pathogens is the state of the art public health response for the bioinformatic source tracking of illness outbreaks. Quasimetagenomics can substantially reduce the amount of culturing needed before a high quality genome can be recovered. Highly accurate short read data is analyzed for single nucleotide polymorphisms and multi-locus sequence types to differentiate strains but cannot span many genomic repeats, resulting in highly fragmented assemblies. Long reads can span repeats, resulting in much more contiguous assemblies, but have lower accuracy than short reads. Results We evaluated the accuracy of Listeria monocytogenes assemblies from enrichments (quasimetagenomes) of naturally-contaminated ice cream using long read (Oxford Nanopore) and short read (Illumina) sequencing data. Accuracy of ten assembly approaches, over a range of sequencing depths, was evaluated by comparing sequence similarity of genes in assemblies to a complete reference genome. Long read assemblies reconstructed a circularized genome as well as a 71 kbp plasmid after 24 h of enrichment; however, high error rates prevented high fidelity gene assembly, even at 150X depth of coverage. Short read assemblies accurately reconstructed the core genes after 28 h of enrichment but produced highly fragmented genomes. Hybrid approaches demonstrated promising results but had biases based upon the initial assembly strategy. Short read assemblies scaffolded with long reads accurately assembled the core genes after just 24 h of enrichment, but were highly fragmented. Long read assemblies polished with short reads reconstructed a circularized genome and plasmid and assembled all the genes after 24 h enrichment but with less fidelity for the core genes than the short read assemblies. Conclusion The integration of long and short read sequencing of quasimetagenomes expedited the reconstruction of a high quality pathogen genome compared to either platform alone. A new and more complete level of information about genome structure, gene order and mobile elements can be added to the public health response by incorporating long read analyses with the standard short read WGS outbreak response.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Maria Artesi ◽  
Vincent Hahaut ◽  
Basiel Cole ◽  
Laurens Lambrechts ◽  
Fereshteh Ashrafi ◽  
...  

AbstractThe integration of a viral genome into the host genome has a major impact on the trajectory of the infected cell. Integration location and variation within the associated viral genome can influence both clonal expansion and persistence of infected cells. Methods based on short-read sequencing can identify viral insertion sites, but the sequence of the viral genomes within remains unobserved. We develop PCIP-seq, a method that leverages long reads to identify insertion sites and sequence their associated viral genome. We apply the technique to exogenous retroviruses HTLV-1, BLV, and HIV-1, endogenous retroviruses, and human papillomavirus.


Author(s):  
Guangtu Gao ◽  
Susana Magadan ◽  
Geoffrey C Waldbieser ◽  
Ramey C Youngblood ◽  
Paul A Wheeler ◽  
...  

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lidong Guo ◽  
Mengyang Xu ◽  
Wenchao Wang ◽  
Shengqiang Gu ◽  
Xia Zhao ◽  
...  

Abstract Background Synthetic long reads (SLR) with long-range co-barcoding information are now widely applied in genomics research. Although several tools have been developed for each specific SLR technique, a robust standalone scaffolder with high efficiency is warranted for hybrid genome assembly. Results In this work, we developed a standalone scaffolding tool, SLR-superscaffolder, to link together contigs in draft assemblies using co-barcoding and paired-end read information. Our top-to-bottom scheme first builds a global scaffold graph based on Jaccard Similarity to determine the order and orientation of contigs, and then locally improves the scaffolds with the aid of paired-end information. We also exploited a screening algorithm to reduce the negative effect of misassembled contigs in the input assembly. We applied SLR-superscaffolder to a human single tube long fragment read sequencing dataset and increased the scaffold NG50 of its corresponding draft assembly 1349 fold. Moreover, benchmarking on different input contigs showed that this approach overall outperformed existing SLR scaffolders, providing longer contiguity and fewer misassemblies, especially for short contigs assembled by next-generation sequencing data. The open-source code of SLR-superscaffolder is available at https://github.com/BGI-Qingdao/SLR-superscaffolder. Conclusions SLR-superscaffolder can dramatically improve the contiguity of a draft assembly by integrating a hybrid assembly strategy.


Sign in / Sign up

Export Citation Format

Share Document