scholarly journals Strainline: full-length de novo viral haplotype reconstruction from noisy long reads

2021 ◽  
Author(s):  
Xiao Luo ◽  
Xiongbin Kang ◽  
Alexander Schoenhuth

Haplotype-resolved de novo assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either can handle only relatively accurate short read data, or collapse haplotype-specific variations into consensus sequence. Here, we present Strainline, a novel approach to assemble viral haplotypes from noisy long reads without a reference genome. As a crucial consequence, Strainline is the first approach to provide strain-resolved, full-length de novo assemblies of viral quasispecies from noisy third-generation sequencing data. Benchmarking experiments on both simulated and real datasets of varying complexity and diversity confirm this novelty, by demonstrating the superiority of Strainline in terms of relevant criteria in comparison with the state of the art.

2020 ◽  
Vol 15 ◽  
Author(s):  
Hongdong Li ◽  
Wenjing Zhang ◽  
Yuwen Luo ◽  
Jianxin Wang

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.


2020 ◽  
Vol 15 (1) ◽  
pp. 2-16
Author(s):  
Yuwen Luo ◽  
Xingyu Liao ◽  
Fang-Xiang Wu ◽  
Jianxin Wang

Transcriptome assembly plays a critical role in studying biological properties and examining the expression levels of genomes in specific cells. It is also the basis of many downstream analyses. With the increase of speed and the decrease in cost, massive sequencing data continues to accumulate. A large number of assembly strategies based on different computational methods and experiments have been developed. How to efficiently perform transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the issues with transcriptome assembly are explored based on different sequencing technologies. Specifically, transcriptome assemblies with next-generation sequencing reads are divided into reference-based assemblies and de novo assemblies. The examples of different species are used to illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength transcripts without assemblies. In addition, different transcriptome assemblies using the Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions of transcriptome assemblies.


Author(s):  
Mengyang Xu ◽  
Lidong Guo ◽  
Xiao Du ◽  
Lei Li ◽  
Brock A Peters ◽  
...  

Abstract Motivation Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. Results To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to co-barcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling co-barcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read based assembly method (TrioCanu) but with a significantly higher single-base accuracy (up to 99.99997% (Q65)). This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. Availability The code of the analysis is available at https://github.com/BGI-Qingdao/HAST. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Marek Kokot ◽  
Adam Gudys ◽  
Heng Li ◽  
Sebastian Deorowicz

The costs of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomics. In spite of the increasing popularity of the third generation sequencing, the existing algorithms for compressing long reads exhibit minor advantage over general purpose gzip. We present CoLoRd, an algorithm able to reduce 3rd generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyzes.


2019 ◽  
Author(s):  
Camille Marchet ◽  
Pierre Morisse ◽  
Lolita Lecompte ◽  
Arnaud Lefebvre ◽  
Thierry Lecroq ◽  
...  

AbstractMotivationIn the last few years, the error rates of third generation sequencing data have been capped above 5%, including many insertions and deletions. Thereby, an increasing number of long reads correction methods have been proposed to reduce the noise in these sequences. Whether hybrid or self-correction methods, there exist multiple approaches to correct long reads. As the quality of the error correction has huge impacts on downstream processes, developing methods allowing to evaluate error correction tools with precise and reliable statistics is therefore a crucial need. Since error correction is often a resource bottleneck in long reads pipelines, a key feature of assessment methods is therefore to be efficient, in order to allow the fast comparison of different tools.ResultsWe propose ELECTOR, a reliable and efficient tool to evaluate long reads correction, that enables the evaluation of hybrid and self-correction methods. Our tool provides a complete and relevant set of metrics to assess the read quality improvement after correction and scales to large datasets. ELECTOR is directly compatible with a wide range of state-of-the-art error correction tools, using whether simulated or real long reads. We show that ELECTOR displays a wider range of metrics than the state-of-the-art tool, LRCstats, and additionally importantly decreases the runtime needed for assessment on all the studied datasets.AvailabilityELECTOR is available at https://github.com/kamimrcht/[email protected] or [email protected]


2017 ◽  
Author(s):  
Juhana I Kammonen ◽  
Olli-Pekka Smolander ◽  
Lars Paulin ◽  
Pedro AB Pereira ◽  
Pia Laine ◽  
...  

Unknown sequences, or gaps, are largely present in most published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while many computational tools exist partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding software that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the actual scaffolding. gapFinisher is based on controlled use of a gap filling tool called FGAP and works on all standard Linux/UNIX command lines. We conclude that performing the workflows of SSPACE-LongRead and gapFinisher enables users to fill gaps reliably. There is no need for further scrutiny of the existing sequencing data after performing the analysis.


Author(s):  
Shien Huang ◽  
Xinyu He ◽  
Guohua Wang ◽  
Ergude Bao

Abstract Contigs assembled from the third-generation sequencing long reads are usually more complete than the second-generation short reads. However, the current algorithms still have difficulty in assembling the long reads into the ideal complete and accurate genome, or the theoretical best result [1]. To improve the long read contigs and with more and more fully sequenced genomes available, it could still be possible to use the similar genome-assisted reassembly method [2], which was initially proposed for the short reads making use of a closely related genome (similar genome) to the sequencing genome (target genome). The method aligns the contigs and reads to the similar genome, and then extends and refines the aligned contigs with the aligned reads. Here, we introduce AlignGraph2, a similar genome-assisted reassembly pipeline for the PacBio long reads. The AlignGraph2 pipeline is the second version of AlignGraph algorithm proposed by us but completely redesigned, can be inputted with either error-prone or HiFi long reads, and contains four novel algorithms: similarity-aware alignment algorithm and alignment filtration algorithm for alignment of the long reads and preassembled contigs to the similar genome, and reassembly algorithm and weight-adjusted consensus algorithm for extension and refinement of the preassembled contigs. In our performance tests on both error-prone and HiFi long reads, AlignGraph2 can align 5.7–27.2% more long reads and 7.3–56.0% more bases than some current alignment algorithm and is more efficient or comparable to the others. For contigs assembled with various de novo algorithms and aligned to similar genomes (aligned contigs), AlignGraph2 can extend 8.7–94.7% of them (extendable contigs), and obtain contigs of 7.0–249.6% larger N50 value and 5.2–87.7% smaller number of indels per 100 kbp (extended contigs). With genomes of decreased similarities, AlignGraph2 also has relatively stable performance. The AlignGraph2 software can be downloaded for free from this site: https://github.com/huangs001/AlignGraph2.


2018 ◽  
Author(s):  
Jana Ebler ◽  
Marina Haukness ◽  
Trevor Pesout ◽  
Tobias Marschall ◽  
Benedict Paten

MotivationCurrent genotyping approaches for single nucleotide variations (SNVs) rely on short, relatively accurate reads from second generation sequencing devices. Presently, third generation sequencing platforms able to generate much longer reads are becoming more widespread. These platforms come with the significant drawback of higher sequencing error rates, which makes them ill-suited to current genotyping algorithms. However, the longer reads make more of the genome unambiguously mappable and typically provide linkage information between neighboring variants.ResultsIn this paper we introduce a novel approach for haplotype-aware genotyping from noisy long reads. We do this by considering bipartitions of the sequencing reads, corresponding to the two haplotypes. We formalize the computational problem in terms of a Hidden Markov Model and compute posterior genotype probabilities using the forward-backward algorithm. Genotype predictions can then be made by picking the most likely genotype at each site. Our experiments indicate that longer reads allow significantly more of the genome to potentially be accurately genotyped. Further, we are able to use both Oxford Nanopore and Pacific Biosciences sequencing data to independently validate millions of variants previously identified by short-read technologies in the reference NA12878 sample, including hundreds of thousands of variants that were not previously included in the high-confidence reference set.


Genes ◽  
2019 ◽  
Vol 10 (7) ◽  
pp. 519
Author(s):  
Danze Chen ◽  
Qianqian Zhao ◽  
Leiming Jiang ◽  
Shuaiyuan Liao ◽  
Zhigang Meng ◽  
...  

Recent analyses show that transcriptome sequencing can be utilized as a diagnostic tool for rare Mendelian diseases. The third generation sequencing de novo detects long reads of thousands of base pairs, thus greatly expanding the isoform discovery and identification of novel long noncoding RNAs. In this study, we developed TGStools, a bioinformatics suite to facilitate routine tasks such as characterizing full-length transcripts, detecting shifted types of alternative splicing, and long noncoding RNAs (lncRNAs) identification in transcriptome analysis. It also prioritizes the transcripts with a visualization framework that automatically integrates rich annotation with known genomic features. TGStools is a Python package freely available at Github.


Sign in / Sign up

Export Citation Format

Share Document