De Novo Genome Assembly of Chinese Plateau Honeybee Unravels Intraspecies Genetic Diversity in the Eastern Honeybee, Apis cerana

Apis cerana abansis, widely distributed in the southeastern margin of the Qinghai-Tibet Plateau, is considered an excellent model to study the phenotype and genetic variation for highland adaptation of Asian honeybee. Herein, we assembled and annotated the chromosome-scale assembly genome of A. cerana abansis with the help of PacBio, Illumina and Hi-C sequencing technologies in order to identify the genome differences between the A. cerana abansis and the published genomes of different A. cerana strains. The sequencing methods, assembly and annotation strategies of A. cerana abansis were more comprehensive than previously published A. cerana genomes. Then, the intraspecific genetic diversity of A. cerana was revealed at the genomic level. We re-identified the repeat content in the genome of A. cerana abansis, as well as the other three A. cerana strains. The chemosensory and immune-related proteins in different A. cerana strains were carefully re-identified, so that 132 odorant receptor subfamilies, 12 gustatory receptor subfamilies and 22 immune-related pathways were found. We also discovered that, compared with other published genomes, the A. ceranaabansis lost the largest number of chemoreceptors compared to other strains, and hypothesized that gene loss/gain might help different A. cerana strains to adapt to their respective environments. Our work contains more complete and precise assembly and annotation results for the A. cerana genome, thus providing a resource for subsequent in-depth related studies.

Download Full-text

De Novo Sequencing and Hybrid Assembly of the Biofuel Crop Jatropha curcas L.: Identification of Quantitative Trait Loci for Geminivirus Resistance

Genes ◽

10.3390/genes10010069 ◽

2019 ◽

Vol 10 (1) ◽

pp. 69 ◽

Cited By ~ 9

Author(s):

Nagesh Kancharla ◽

Saakshi Jalali ◽

J. Narasimham ◽

Vinod Nair ◽

Vijay Yepuri ◽

...

Keyword(s):

Ssr Markers ◽

Genome Assembly ◽

Jatropha Curcas ◽

Quantitative Trait ◽

De Novo ◽

Mapping Population ◽

Single Copy ◽

Sequencing Data ◽

De Novo Genome Assembly ◽

Sequencing Technologies

Jatropha curcas is an important perennial, drought tolerant plant that has been identified as a potential biodiesel crop. We report here the hybrid de novo genome assembly of J. curcas generated using Illumina and PacBio sequencing technologies, and identification of quantitative loci for Jatropha Mosaic Virus (JMV) resistance. In this study, we generated scaffolds of 265.7 Mbp in length, which correspond to 84.8% of the gene space, using Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis. Additionally, 96.4% of predicted protein-coding genes were captured in RNA sequencing data, which reconfirms the accuracy of the assembled genome. The genome was utilized to identify 12,103 dinucleotide simple sequence repeat (SSR) markers, which were exploited in genetic diversity analysis to identify genetically distinct lines. A total of 207 polymorphic SSR markers were employed to construct a genetic linkage map for JMV resistance, using an interspecific F2 mapping population involving susceptible J. curcas and resistant Jatropha integerrima as parents. Quantitative trait locus (QTL) analysis led to the identification of three minor QTLs for JMV resistance, and the same has been validated in an alternate F2 mapping population. These validated QTLs were utilized in marker-assisted breeding for JMV resistance. Comparative genomics of oil-producing genes across selected oil producing species revealed 27 conserved genes and 2986 orthologous protein clusters in Jatropha. This reference genome assembly gives an insight into the understanding of the complex genetic structure of Jatropha, and serves as source for the development of agronomically improved virus-resistant and oil-producing lines.

Download Full-text

A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies

PLoS ONE ◽

10.1371/journal.pone.0017915 ◽

2011 ◽

Vol 6 (3) ◽

pp. e17915 ◽

Cited By ~ 144

Author(s):

Wenyu Zhang ◽

Jiajia Chen ◽

Yang Yang ◽

Yifei Tang ◽

Jing Shang ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genome Assembly ◽

De Novo ◽

Software Tools ◽

Next Generation ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Generation Sequencing ◽

Assembly Software

Download Full-text

Hybrid de novo Genome Assembly of Erwinia sp. E602 and Bioinformatic Analysis Characterized a New Plasmid-Borne lac Operon Under Positive Selection

Frontiers in Microbiology ◽

10.3389/fmicb.2021.783195 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yu Xia ◽

Zhi-Yuan Wei ◽

Rui He ◽

Jia-Huan Li ◽

Zhi-Xin Wang ◽

...

Keyword(s):

Positive Selection ◽

Genome Assembly ◽

De Novo ◽

Bioinformatic Analysis ◽

Lac Operon ◽

Pacbio Sequencing ◽

Metabolic Pathway Analysis ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Lactose Metabolism

Our previous study identified a new β-galactosidase in Erwinia sp. E602. To further understand the lactose metabolism in this strain, de novo genome assembly was conducted by using a strategy combining Illumina and PacBio sequencing technology. The whole genome of Erwinia sp. E602 includes a 4.8 Mb chromosome and a 326 kb large plasmid. A total of 4,739 genes, including 4,543 protein-coding genes, 25 rRNAs, 82 tRNAs and 7 other ncRNAs genes were annotated. The plasmid was the largest one characterized in genus Erwinia by far, and it contained a number of genes and pathways responsible for lactose metabolism and regulation. Moreover, a new plasmid-borne lac operon that lacked a typical β-galactoside transacetylase (lacA) gene was identified in the strain. Phylogenetic analysis showed that the genes lacY and lacZ in the operon were under positive selection, indicating the adaptation of lactose metabolism to the environment in Erwinia sp. E602. Our current study demonstrated that the hybrid de novo genome assembly using Illumina and PacBio sequencing technologies, as well as the metabolic pathway analysis, provided a useful strategy for better understanding of the evolution of undiscovered microbial species or strains.

Download Full-text

De novo Nanopore read quality improvement using deep learning

BMC Bioinformatics ◽

10.1186/s12859-019-3103-z ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

Nathan LaPierre ◽

Rob Egan ◽

Wei Wang ◽

Zhong Wang

Keyword(s):

Error Correction ◽

Genome Assembly ◽

Large Scale ◽

De Novo ◽

Error Rates ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Read Error Correction

Abstract Background Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. Results Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. Conclusions MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub.

Download Full-text

Benchmarking metagenomic classification tools for long-read sequencing data

10.1101/2020.11.25.397729 ◽

2020 ◽

Author(s):

Josip Marić ◽

Krešimir Križanović ◽

Sylvain Riondet ◽

Niranjan Nagarajan ◽

Mile Šikić

Keyword(s):

De Novo ◽

Real Life ◽

Metagenomic Analysis ◽

Sequencing Data ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Long Reads ◽

Species Abundances ◽

Long Read ◽

Eukaryotic Genomes

ABSTRACTIn recent years, both long-read sequencing and metagenomic analysis have been significantly advanced. Although long-read sequencing technologies have been primarily used for de novo genome assembly, they are rapidly maturing for widespread use in other applications. In particular, long reads could potentially lead to more precise taxonomic identification, which has sparked an interest in using them for metagenomic analysis.Here we present a benchmark of several state-of-the-art tools for metagenomic taxonomic classification, tested on in-silico datasets constructed using real long reads from isolate sequencing. We compare tools that were either newly developed or modified to work with long reads, including k-mer based tools Kraken2, Centrifuge and CLARK, and mapping-based tools MetaMaps and MEGAN-LR. The test datasets were constructed with varying numbers of bacterial and eukaryotic genomes to simulate different real-life metagenomic applications. The tools were tested to detect species accurately and precisely estimate species abundances in the samples.Our analysis shows that all tested classifiers provide useful results, and the composition of the used database strongly influences the performance. Using the same database, tested tools achieve comparable results except for MetaMaps, which slightly outperform others in most metrics, but it is significantly slower than k-mer based tools.We deem there is significant room for improvement for all tested tools, especially in lowering the number of false-positive detections.

Download Full-text

Yet another de novo genome assembler

10.1101/656306 ◽

2019 ◽

Cited By ~ 4

Author(s):

Robert Vaser ◽

Mile Šikić

Keyword(s):

De Novo ◽

Sequence Classification ◽

De Novo Genome Assembly ◽

Development Fund ◽

European Regional Development Fund ◽

Sequencing Technologies ◽

Single Genome ◽

Long Read ◽

Metagenome Assembly ◽

Genome Assemblies

AbstractAdvances in sequencing technologies have pushed the limits of genome assemblies beyond imagination. The sheer amount of long read data that is being generated enables the assembly for even the largest and most complex organism for which efficient algorithms are needed. We present a new tool, called Ra, for de novo genome assembly of long uncorrected reads. It is a fast and memory friendly assembler based on sequence classification and assembly graphs, developed with large genomes in mind. It is freely available at https://github.com/lbcb-sci/ra.This work has been supported in part by the Croatian Science Foundation under the project Single genome and metagenome assembly (IP-2018-01-5886), and in part by the European Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS). In addition, M.Š. is partly supported by funding from A*STAR, Singapore.

Download Full-text

Human Genome Assembly in 100 Minutes

10.1101/705616 ◽

2019 ◽

Cited By ~ 20

Author(s):

Chen-Shan Chin ◽

Asif Khalak

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Genome Assembly ◽

De Novo ◽

Critical Factor ◽

Read Length ◽

De Novo Genome Assembly ◽

Small Indels ◽

Sequencing Technologies ◽

Long Read

AbstractDe novo genome assembly provides comprehensive, unbiased genomic information and makes it possible to gain insight into new DNA sequences not present in reference genomes. Many de novo human genomes have been published in the last few years, leveraging a combination of inexpensive short-read and single-molecule long-read technologies. As long-read DNA sequencers become more prevalent, the computational burden of generating assemblies persists as a critical factor. The most common approach to long-read assembly, using an overlap-layout-consensus (OLC) paradigm, requires all-to-all read comparisons, which quadratically scales in computational complexity with the number of reads. We assert that recently achievements in sequencing technology (i.e. with accuracy ~99% and read length ~10-15k) enables a fundamentally better strategy for OLC that is effectively linear rather than quadratic. Our genome assembly implementation, Peregrine uses sparse hierarchical minimizers (SHIMMER) to index reads thereby avoiding the need for an all-to-all read comparison step. Peregrine can assemble 30x human PacBio CCS read datasets in less than 30 CPU hours and around 100 wall-clock minutes to a high contiguity assembly (N50 > 20Mb). The continued advance of sequencing technologies coupled with the Peregrine assembler enables routine generation of human de novo assemblies. This will allow for population scale measurements of more comprehensive genomic variations -- beyond SNPs and small indels -- as well as novel applications requiring rapid access to de novo assemblies.

Download Full-text

Scalable multi whole-genome alignment using recursive exact matching

10.1101/022715 ◽

2015 ◽

Cited By ~ 9

Author(s):

Jasper Linthorst ◽

Marc Hulsman ◽

Henne Holstege ◽

Marcel Reinders

Keyword(s):

De Novo ◽

Genome Alignment ◽

Exact Matching ◽

Structural Variations ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Computational Performance ◽

Human Genomes ◽

Novel Concept

The emergence of third generation sequencing technologies has brought near perfect de-novo genome assembly within reach. This clears the way towards reference-free detection of genomic variations. In this paper, we introduce a novel concept for aligning whole-genomes which allows the alignment of multiple genomes. Alignments are constructed in a recursive manner, in which alignment decisions are statistically supported. Computational performance is achieved by splitting an initial indexing data structure into a multitude of smaller indices. We show that our method can be used to detect high resolution structural variations between two human genomes, and that it can be used to obtain a high quality multiple genome alignment of at least nineteen Mycobacterium tuberculosis genomes. An implementation of the outlined algorithm called REVEAL is available on: https://github.com/jasperlinthorst/REVEAL

Download Full-text

Methods for De-novo Genome Assembly

10.20944/preprints202006.0324.v1 ◽

2020 ◽

Author(s):

Arash Bayat ◽

Hasindu Gamaarachchi ◽

Nandan P. Deshpande ◽

Marc R. Wilkins ◽

Sri Parameswaran

Keyword(s):

Genome Assembly ◽

De Novo ◽

Detailed Comparison ◽

Hybrid Assembly ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Comparative Review ◽

Long Read ◽

Synthetic Datasets

Despite advances in algorithms and computational platforms, de-novo genome assembly remains a challenging process. Due to the constant innovation in sequencing technologies (Sanger, SOLiD, Illumina, 454, PacBio and Oxford Nanopore), genome assembly has evolved to respond to the changes in input data type. This paper includes a broad and comparative review of the most recent short-read, long-read and hybrid assembly techniques. In this review, we provide (1) an algorithmic description of the important processes in the workflow that introduces fundamental concepts and improvements; (2) a review of existing software that explains possible options for genome assembly; and (3) a comparison of the accuracy and the performance of existing methods executed on the same computer using the same processing capabilities and using the same set of real and synthetic datasets. Such evaluation allows a fair and precise comparison of accuracy in all aspects. As a result, this paper identifies both the strengths and weaknesses of each method. This comparative review is unique in providing a detailed comparison of a broad spectrum of cutting-edge algorithms and methods.

Download Full-text