Systematic Comparison of the Performances of De Novo Genome Assemblers for Oxford Nanopore Technology Reads From Piroplasm

BackgroundEmerging long reads sequencing technology has greatly changed the landscape of whole-genome sequencing, enabling scientists to contribute to decoding the genetic information of non-model species. The sequences generated by PacBio or Oxford Nanopore Technology (ONT) be assembled de novo before further analyses. Some genome de novo assemblers have been developed to assemble long reads generated by ONT. The performance of these assemblers has not been completely investigated. However, genome assembly is still a challenging task.Methods and ResultsWe systematically evaluated the performance of nine de novo assemblers for ONT on different coverage depth datasets. Several metrics were measured to determine the performance of these tools, including N50 length, sequence coverage, runtime, easy operation, accuracy of genome and genomic completeness in varying depths of coverage. Based on the results of our assessments, the performances of these tools are summarized as follows: 1) Coverage depth has a significant effect on genome quality; 2) The level of contiguity of the assembled genome varies dramatically among different de novo tools; 3) The correctness of an assembled genome is closely related to the completeness of the genome. More than 30× nanopore data can be assembled into a relatively complete genome, the quality of which is highly dependent on the polishing using next generation sequencing data.ConclusionConsidering the results of our investigation, the advantage and disadvantage of each tool are summarized and guidelines of selecting assembly tools are provided under specific conditions.

Download Full-text

SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme

BMC Bioinformatics ◽

10.1186/s12859-021-04081-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lidong Guo ◽

Mengyang Xu ◽

Wenchao Wang ◽

Shengqiang Gu ◽

Xia Zhao ◽

...

Keyword(s):

High Efficiency ◽

De Novo ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Draft Assembly ◽

Screening Algorithm ◽

Long Reads ◽

Hybrid Genome ◽

Genomics Research ◽

Negative Effect

Abstract Background Synthetic long reads (SLR) with long-range co-barcoding information are now widely applied in genomics research. Although several tools have been developed for each specific SLR technique, a robust standalone scaffolder with high efficiency is warranted for hybrid genome assembly. Results In this work, we developed a standalone scaffolding tool, SLR-superscaffolder, to link together contigs in draft assemblies using co-barcoding and paired-end read information. Our top-to-bottom scheme first builds a global scaffold graph based on Jaccard Similarity to determine the order and orientation of contigs, and then locally improves the scaffolds with the aid of paired-end information. We also exploited a screening algorithm to reduce the negative effect of misassembled contigs in the input assembly. We applied SLR-superscaffolder to a human single tube long fragment read sequencing dataset and increased the scaffold NG50 of its corresponding draft assembly 1349 fold. Moreover, benchmarking on different input contigs showed that this approach overall outperformed existing SLR scaffolders, providing longer contiguity and fewer misassemblies, especially for short contigs assembled by next-generation sequencing data. The open-source code of SLR-superscaffolder is available at https://github.com/BGI-Qingdao/SLR-superscaffolder. Conclusions SLR-superscaffolder can dramatically improve the contiguity of a draft assembly by integrating a hybrid assembly strategy.

Download Full-text

De Novo Genome Assembly of Next-Generation Sequencing Data

Compendium of Plant Genomes - The Brassica rapa Genome ◽

10.1007/978-3-662-47901-8_4 ◽

2015 ◽

pp. 41-51

Author(s):

Min Liu ◽

Dongyuan Liu ◽

Hongkun Zheng

Keyword(s):

Next Generation Sequencing ◽

Genome Assembly ◽

De Novo ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

De Novo Genome Assembly ◽

Generation Sequencing

Download Full-text

Computational Approaches for Transcriptome Assembly Based on Sequencing Technologies

Current Bioinformatics ◽

10.2174/1574893614666190410155603 ◽

2020 ◽

Vol 15 (1) ◽

pp. 2-16

Author(s):

Yuwen Luo ◽

Xingyu Liao ◽

Fang-Xiang Wu ◽

Jianxin Wang

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Critical Role ◽

High Sensitivity ◽

Biological Properties ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Massive Sequencing ◽

Generation Sequencing

Transcriptome assembly plays a critical role in studying biological properties and examining the expression levels of genomes in specific cells. It is also the basis of many downstream analyses. With the increase of speed and the decrease in cost, massive sequencing data continues to accumulate. A large number of assembly strategies based on different computational methods and experiments have been developed. How to efficiently perform transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the issues with transcriptome assembly are explored based on different sequencing technologies. Specifically, transcriptome assemblies with next-generation sequencing reads are divided into reference-based assemblies and de novo assemblies. The examples of different species are used to illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength transcripts without assemblies. In addition, different transcriptome assemblies using the Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions of transcriptome assemblies.

Download Full-text

DeNovoCNN: A deep learning approach to de novo variant calling in next generation sequencing data

10.1101/2021.09.20.461072 ◽

2021 ◽

Author(s):

Gelana Khazeeva ◽

Karolis Sablauskas ◽

Bart van der Sanden ◽

Wouter Steyaert ◽

Michael Kwint ◽

...

Keyword(s):

Exome Sequencing ◽

De Novo ◽

Genetic Disorders ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Accurate Identification ◽

Whole Exome ◽

De Novo Variant ◽

Generation Sequencing

De novo mutations (DNMs) are an important cause of genetic disorders. The accurate identification of DNMs from sequencing data is therefore fundamental to rare disease research and diagnostics. Unfortunately, identifying reliable DNMs remains a major challenge due to sequence errors, uneven coverage, and mapping artifacts. Here, we developed a deep convolutional neural network (CNN) DNM caller (DeNovoCNN), that encodes alignment of sequence reads for a trio as 160×164 resolution images. DeNovoCNN was trained on DNMs of whole exome sequencing (WES) of 2003 trios achieving on average 99.2% recall and 93.8% precision. We find that DeNovoCNN has increased recall/sensitivity and precision compared to existing de novo calling approaches (GATK, DeNovoGear, Samtools) based on the Genome in a Bottle reference dataset. Sanger validations of DNMs called in both exome and genome datasets confirm that DeNovoCNN outperforms existing methods. Most importantly, we show that DeNovoCNN is robust against different exome sequencing and analyses approaches, thereby allowing it to be applied on other datasets. DeNovoCNN is freely available and can be run on existing alignment (BAM/CRAM) and variant calling (VCF) files from WES and WGS without a need for variant recalling.

Download Full-text

An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data

Nucleic Acids Research ◽

10.1093/nar/gkv002 ◽

2015 ◽

Vol 43 (7) ◽

pp. e46-e46 ◽

Cited By ~ 125

Author(s):

Xutao Deng ◽

Samia N. Naccache ◽

Terry Ng ◽

Scot Federman ◽

Linlin Li ◽

...

Keyword(s):

Next Generation Sequencing ◽

De Novo Assembly ◽

De Novo ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Next Generation ◽

Sequencing Data ◽

Short Reads ◽

Ensemble Strategy ◽

Generation Sequencing

Abstract Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

Download Full-text

Draft genome assemblies using sequencing reads from Oxford Nanopore Technology and Illumina platforms for four species of North American Fundulus killifish

GigaScience ◽

10.1093/gigascience/giaa067 ◽

2020 ◽

Vol 9 (6) ◽

Cited By ~ 3

Author(s):

Lisa K Johnson ◽

Ruta Sahasrabudhe ◽

James Anthony Gill ◽

Jennifer L Roach ◽

Lutz Froenicke ◽

...

Keyword(s):

North American ◽

De Novo ◽

Draft Genome ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Sequence Coverage ◽

Short Read ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Background Whole-genome sequencing data from wild-caught individuals of closely related North American killifish species (Fundulus xenicus, Fundulus catenatus, Fundulus nottii, and Fundulus olivaceus) were obtained using long-read Oxford Nanopore Technology (ONT) PromethION and short-read Illumina platforms. Findings Draft de novo reference genome assemblies were generated using a combination of long and short sequencing reads. For each species, the PromethION platform was used to generate 30–45× sequence coverage, and the Illumina platform was used to generate 50–160× sequence coverage. Illumina-only assemblies were fragmented with high numbers of contigs, while ONT-only assemblies were error prone with low BUSCO scores. The highest N50 values, ranging from 0.4 to 2.7 Mb, were from assemblies generated using a combination of short- and long-read data. BUSCO scores were consistently >90% complete using the Eukaryota database. Conclusions High-quality genomes can be obtained from a combination of using short-read Illumina data to polish assemblies generated with long-read ONT data. Draft assemblies and raw sequencing data are available for public use. We encourage use and reuse of these data for assembly benchmarking and other analyses.

Download Full-text

A new strategy for enhancing imputation quality of rare variants from next-generation sequencing data via combining SNP and exome chip data

BMC Genomics ◽

10.1186/s12864-015-2192-y ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 6

Author(s):

Young Jin Kim ◽

◽

Juyoung Lee ◽

Bong-Jo Kim ◽

Taesung Park

Keyword(s):

Next Generation Sequencing ◽

Rare Variants ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Chip Data ◽

New Strategy ◽

Exome Chip ◽

Generation Sequencing

Download Full-text

METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

10.1101/2020.10.18.344697 ◽

2020 ◽

Author(s):

Zhenmiao Zhang ◽

Lu Zhang

Keyword(s):

De Novo ◽

State Of The Art ◽

Label Propagation ◽

Next Generation Sequencing Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Complete Genomes ◽

Generation Sequencing ◽

High Chance ◽

Mock Communities

AbstractMotivationDue to the complexity of metagenomic community, de novo assembly on next generation sequencing data is commonly unable to produce microbial complete genomes. Metagenomic binning is a crucial task that could group the fragmented contigs into clusters based on their nucleotide compositions and read depths. These features work well on the long contigs, but are not stable for the short ones. Assembly and paired-end graphs can provide the connectedness between contigs, where the linked contigs have high chance to be derived from the same clusters.ResultsWe developed METAMVGL, a multi-view graph-based metagenomic contig binning algorithm by integrating both assembly and paired-end graphs. It could strikingly rescue the short contigs and correct the binning errors from dead ends subgraphs. METAMVGL could learn the graphs’ weights automatically and predict the contig labels in a uniform multi-view label propagation framework. In the experiments, we observed METAMVGL significantly increased the high-confident edges in the combined graph and linked dead ends to the main graph. It also outperformed with many state-of-the-art binning methods, MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and Graphbin on the metagenomic sequencing from simulation, two mock communities and real Sharon data.Availability and implementationThe software is available at https://github.com/ZhangZhenmiao/METAMVGL.

Download Full-text

Clustering de Novo by Gene of Long Reads from Transcriptomics Data

10.1101/170035 ◽

2017 ◽

Cited By ~ 3

Author(s):

Camille Marchet ◽

Lolita Lecompte ◽

Corinne Da Silva ◽

Corinne Cruaud ◽

Jean-Marc Aury ◽

...

Keyword(s):

De Novo ◽

Free Access ◽

Sequencing Data ◽

Base Pairs ◽

Long Reads ◽

Oxford Nanopore ◽

Processing Step ◽

Whole Transcriptome Sequencing ◽

Long Read ◽

Transcriptomics Data

AbstractLong-read sequencing currently provides sequences of several thousand base pairs. This allows to obtain complete transcripts, which offers an un-precedented vision of the cellular transcriptome.However the literature is lacking tools to cluster such data de novo, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads.Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution is both to propose a new algorithm adapted to clustering of reads by gene and a practical and free access tool that permits to scale the complete processing of eukaryotic transcriptomes.We sequenced a mouse RNA sample using the MinION device, this dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate its is better-suited for transcriptomics long reads. When a reference is available thus mapping possible, we show that it stands as an alternative method that predicts complementary clusters.

Download Full-text

Draft genome assembly and transcriptome sequencing of the golden algae Hydrurus foetidus (Chrysophyceae)

F1000Research ◽

10.12688/f1000research.16734.1 ◽

2019 ◽

Vol 8 ◽

pp. 401

Author(s):

Jon Bråte ◽

Janina Fuss ◽

Kjetill S. Jakobsen ◽

Dag Klaveness

Keyword(s):

Genome Assembly ◽

Draft Genome ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Draft Genome Assembly ◽

Alpine Regions ◽

Long Reads ◽

Branching Patterns ◽

Variable Morphology ◽

Generation Sequencing

Hydrurus foetidus is a freshwater alga belonging to the phylum Heterokonta. It thrives in cold rivers in polar and high alpine regions. It has several morphological traits reminiscent of single-celled eukaryotes, but can also form macroscopic thalli. Despite its ability to produce polyunsaturated fatty acids, its life under cold conditions and its variable morphology, very little is known about its genome and transcriptome. Here, we present an extensive set of next-generation sequencing data, including genomic short reads from Illumina sequencing and long reads from Nanopore sequencing, as well as full length cDNAs from PacBio IsoSeq sequencing and a small RNA dataset (smaller than 200 bp) sequenced with Illumina. We combined this data with, to our knowledge, the first draft genome assembly of a chrysophyte algae. The assembly consists of 5069 contigs to a total assembly size of 171 Mb and a 77% BUSCO completeness. The new data generated here may contribute to a better understanding of the evolution and ecological roles of chrysophyte algae, as well as to resolve the branching patterns within the Heterokonta.

Download Full-text