The genome sequence of Sorbus pohuashanensis provides insights into population evolution and leaf sunburn response

Sorbus pohuashanensis is a potential horticulture and medicinal plant, but its genomic and genetic background remains unknown. Here, we de novo sequenced and assembled the S. pohuashanensis (Hance) Hedl. reference genome using PacBio long reads. Based on the new reference genome, we resequenced a core collection of 22 Sorbus spp. samples, which were divided into two groups (G1 and G2) based on phylogenetic and PCA analysis. These phylogenetic clusters were highly in accordance with the classification based on leaf shape. Natural hybridization between the G1 and G2 groups was evidenced by a sample (R21) with a highly heterozygous genotype. Nucleotide diversity (π) analysis showed that G1 has a higher diversity than G2, and that G2 originated from G1. During the evolution process, the gene families involved in photosynthesis pathways expanded and gene families involved in energy consumption contracted. Comparative genome analysis showed that S. pohuashanensis has a high level of chromosomal synteny with Malus domestica and Pyrus communis. RNA-seq data suggested that flavonol biosynthesis and heat-shock protein (HSP)-heat-shock factor (HSF) pathways play important roles in protection against sunburn. This research provides new insight into the evolution of Sorbus spp. genomes. In addition, the genomic resources and the identified genetic variations, especially those genes related to stress resistance, will help future efforts to introduce and breed Sorbus spp.

Download Full-text

Identification and Expression Analysis of the Genes Involved in the Raffinose Family Oligosaccharides Pathway of Phaseolus vulgaris and Glycine max

Plants ◽

10.3390/plants10071465 ◽

2021 ◽

Vol 10 (7) ◽

pp. 1465

Author(s):

Ramon de Koning ◽

Raphaël Kiekens ◽

Mary Esther Muyoka Toili ◽

Geert Angenon

Keyword(s):

Common Bean ◽

Seed Development ◽

Expression Analysis ◽

De Novo ◽

Expression Patterns ◽

Gene Families ◽

Rna Seq ◽

Raffinose Family Oligosaccharides ◽

Specific Expression ◽

Raffinose Synthase

Raffinose family oligosaccharides (RFO) play an important role in plants but are also considered to be antinutritional factors. A profound understanding of the galactinol and RFO biosynthetic gene families and the expression patterns of the individual genes is a prerequisite for the sustainable reduction of the RFO content in the seeds, without compromising normal plant development and functioning. In this paper, an overview of the annotation and genetic structure of all galactinol- and RFO biosynthesis genes is given for soybean and common bean. In common bean, three galactinol synthase genes, two raffinose synthase genes and one stachyose synthase gene were identified for the first time. To discover the expression patterns of these genes in different tissues, two expression atlases have been created through re-analysis of publicly available RNA-seq data. De novo expression analysis through an RNA-seq study during seed development of three varieties of common bean gave more insight into the expression patterns of these genes during the seed development. The results of the expression analysis suggest that different classes of galactinol- and RFO synthase genes have tissue-specific expression patterns in soybean and common bean. With the obtained knowledge, important galactinol- and RFO synthase genes that specifically play a key role in the accumulation of RFOs in the seeds are identified. These candidate genes may play a pivotal role in reducing the RFO content in the seeds of important legumes which could improve the nutritional quality of these beans and would solve the discomforts associated with their consumption.

Download Full-text

Haplotype-resolved genome of diploid ginger (Zingiber officinale) and its unique gingerol biosynthetic pathway

Horticulture Research ◽

10.1038/s41438-021-00627-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hong-Lei Li ◽

Lin Wu ◽

Zhaoming Dong ◽

Yusong Jiang ◽

Sanjie Jiang ◽

...

Keyword(s):

Biosynthetic Pathway ◽

Southwest China ◽

Reference Genome ◽

Zingiber Officinale ◽

Gene Families ◽

Chromosome Conformation ◽

Long Reads ◽

Transcription Factor Networks ◽

Species Specific ◽

Haplotype 1

AbstractGinger (Zingiber officinale), the type species of Zingiberaceae, is one of the most widespread medicinal plants and spices. Here, we report a high-quality, chromosome-scale reference genome of ginger ‘Zhugen’, a traditionally cultivated ginger in Southwest China used as a fresh vegetable, assembled from PacBio long reads, Illumina short reads, and high-throughput chromosome conformation capture (Hi-C) reads. The ginger genome was phased into two haplotypes, haplotype 1 (1.53 Gb with a contig N50 of 4.68 M) and haplotype 0 (1.51 Gb with a contig N50 of 5.28 M). Homologous ginger chromosomes maintained excellent gene pair collinearity. In 17,226 pairs of allelic genes, 11.9% exhibited differential expression between alleles. Based on the results of ginger genome sequencing, transcriptome analysis, and metabolomic analysis, we proposed a backbone biosynthetic pathway of gingerol analogs, which consists of 12 enzymatic gene families, PAL, C4H, 4CL, CST, C3’H, C3OMT, CCOMT, CSE, PKS, AOR, DHN, and DHT. These analyses also identified the likely transcription factor networks that regulate the synthesis of gingerol analogs. Overall, this study serves as an excellent resource for further research on ginger biology and breeding, lays a foundation for a better understanding of ginger evolution, and presents an intact biosynthetic pathway for species-specific gingerol biosynthesis.

Download Full-text

CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009631 ◽

2021 ◽

Vol 17 (11) ◽

pp. e1009631

Author(s):

Raquel Linheiro ◽

John Archer

Keyword(s):

De Novo ◽

Simulated Data ◽

Real Data ◽

Gene Families ◽

Classification Systems ◽

Whole Body ◽

Cdna Libraries ◽

Sequence Information ◽

Rna Seq ◽

High Quality

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.

Download Full-text

Genome-wide investigation of heat shock transcription factor family in wheat (Triticum aestivum L.)

10.21203/rs.2.9414/v1 ◽

2019 ◽

Author(s):

Jiali Ye ◽

Xuetong Yang ◽

Sha Li ◽

Wei Li ◽

Qi Liu ◽

...

Keyword(s):

Heat Shock ◽

Reference Genome ◽

Phylogenetic Analyses ◽

Triticum Aestivum L ◽

Growth Stages ◽

Rna Seq ◽

Male Sterile ◽

Heat Shock Transcription Factors ◽

Genome Wide ◽

Resistance To Heat

Abstract Background: Heat shock transcription factors (HSFs) play crucial roles in resisting heat stress and regulating plant development. Investigating the HSF family is essential for understanding the fertility conversion mechanism in thermo-sensitive male sterile wheat. Previous studies have investigated the HSF family in wheat but it is necessary to conduct more in-depth and systematic analyses based on the newly published reference genome. Results: In the present study, 61 wheat Hsf (TaHsf) genes were identified using two main strategies and renamed based on their physical locations on chromosomes. According to the gene structure and phylogenetic analyses, the 61 TaHsf genes were classified into three categories and eleven subclasses. The genes were unequally distributed on 21 chromosomes, including two pairs of tandem duplication genes and 52 TaHsf segmental duplication genes. According to the cis-elements identified, most of the TaHsfs can be activated by Ca++ and MYB, and they respond to drought, light, copper, and other stresses as well as heat shock. RNA-seq analysis indicated that the A2 class TaHsf genes exhibited persistently upregulated expression levels in the leaves/shoots, roots (except in the vegetative growth and reproductive growth stages), spikes, and grains in wheat under normal conditions. The A and B class TaHsf genes were positively regulated during the resistance to heat, whereas the C class genes were involved in drought regulation in wheat. Only the A and B class TaHsf genes were upregulated under fertile conditions in thermo-sensitive male sterile wheat. Conclusion: In this study, 61 wheat Hsf genes were identified based on the complete wheat reference genome. This comprehensive analysis provides novel insights into the TaHsf genes, including their diverse functions and involvement in metabolic pathways.

Download Full-text

De novo Assembly of the Brugia malayi Genome Using Long Reads from a Single MinION Flowcell

Scientific Reports ◽

10.1038/s41598-019-55908-y ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Joseph R. Fauver ◽

John Martin ◽

Gary J. Weil ◽

Makedonka Mitreva ◽

Peter U. Fischer

Keyword(s):

Single Molecule ◽

New Technologies ◽

Reference Genome ◽

De Novo ◽

Complete Mitochondrial Genome ◽

Nuclear Genome ◽

Brugia Malayi ◽

Field Isolates ◽

Sequencing Technologies ◽

Long Reads

AbstractFilarial nematode infections cause a substantial global disease burden. Genomic studies of filarial worms can improve our understanding of their biology and epidemiology. However, genomic information from field isolates is limited and available reference genomes are often discontinuous. Single molecule sequencing technologies can reduce the cost of genome sequencing and long reads produced from these devices can improve the contiguity and completeness of genome assemblies. In addition, these new technologies can make generation and analysis of large numbers of field isolates feasible. In this study, we assessed the performance of the Oxford Nanopore Technologies MinION for sequencing and assembling the genome of Brugia malayi, a human parasite widely used in filariasis research. Using data from a single MinION flowcell, a 90.3 Mb nuclear genome was assembled into 202 contigs with an N50 of 2.4 Mb. This assembly covered 96.9% of the well-defined B. malayi reference genome with 99.2% identity. The complete mitochondrial genome was obtained with individual reads and the nearly complete genome of the endosymbiotic bacteria Wolbachia was assembled alongside the nuclear genome. Long-read data from the MinION produced an assembly that approached the quality of a well-established reference genome using comparably fewer resources.

Download Full-text

Illumina TruSeq synthetic long-reads empowerde novoassembly and resolve complex, highly repetitive transposable elements

10.1101/001834 ◽

2014 ◽

Cited By ~ 1

Author(s):

Rajiv C McCoy ◽

Ryan W Taylor ◽

Timothy A Blauwkamp ◽

Joanna L Kelley ◽

Michael Kertesz ◽

...

Keyword(s):

Transposable Elements ◽

Reference Genome ◽

De Novo ◽

Model Organism ◽

Genomic Analysis ◽

High Sequence Identity ◽

Current Reference ◽

Sequencing Technologies ◽

Long Reads ◽

Whole Genomes

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including thede novoassembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or present in complex genomic arrangements. While TEs strongly affect genome function and evolution, most currentde novoassembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 1.5-18.5 Kbp with an extremely low error rate (∼0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organismDrosophila melanogaster(reference genome strainy;cn,bw,sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 of annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long- reads, and likely other methods that generate long reads, offer a powerful approach to improvede novoassemblies of whole genomes.

Download Full-text

An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps

10.1101/767764 ◽

2019 ◽

Cited By ~ 16

Author(s):

Prashant S. Hosmani ◽

Mirella Flores-Gonzalez ◽

Henri van de Geest ◽

Florian Maumus ◽

Linda V. Bakker ◽

...

Keyword(s):

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Proximity Ligation ◽

Contact Maps ◽

Long Reads ◽

Blast Database ◽

Optical Maps ◽

Almost All ◽

454 Sequences

AbstractThe original Heinz 1706 reference genome was produced by a large team of scientists from across the globe from a variety of input sources that included 454 sequences in addition to full-length BACs, BAC and fosmid ends sequenced with Sanger technology. We present here the latest tomato reference genome (SL4.0) assembled de novo from PacBio long reads and scaffolded using Hi-C contact maps. The assembly was validated using Bionano optical maps and 10X linked-read sequences. This assembly is highly contiguous with fewer gaps compared to previous genome builds and almost all scaffolds have been anchored and oriented to the 12 tomato chromosomes. We have found more repeats compared to the previous versions and one of the largest repeat classes identified are the LTR retrotransposons. We also describe updates to the reference genome and annotation since the last publication. The corresponding ITAG4.0 annotation has 4,794 novel genes along with 29,281 genes preserved from ITAG2.4. Most of the updated genes have extensions in the 5’ and 3’ UTRs resulting in doubling of annotated UTRs per gene. The genome and annotation can be accessed using SGN through BLAST database, Pathway database (SolCyc), Apollo, JBrowse genome browser and FTP available at https://solgenomics.net.

Download Full-text

The evaluation of RNA-Seq de novo assembly by PacBio long read sequencing

10.1101/735621 ◽

2019 ◽

Author(s):

Yifan Yang ◽

Michael Gribskov

Keyword(s):

Real Time ◽

De Novo ◽

Critical Issue ◽

Evaluation Methods ◽

Model Organisms ◽

Rna Seq ◽

Long Reads ◽

Long Read ◽

Set Up ◽

Downstream Analysis

AbstractRNA-Seq de novo assembly is an important method to generate transcriptomes for non-model organisms before any downstream analysis. Given many great de novo assembly methods developed by now, one critical issue is that there is no consensus on the evaluation of de novo assembly methods yet. Therefore, to set up a benchmark for evaluating the quality of de novo assemblies is very critical. Addressing this challenge will help us deepen the insights on the properties of different de novo assemblers and their evaluation methods, and provide hints on choosing the best assembly sets as transcriptomes of non-model organisms for the further functional analysis. In this article, we generate a “real time” transcriptome using PacBio long reads as a benchmark for evaluating five de novo assemblers and two model-based de novo assembly evaluation methods. By comparing the de novo assmblies generated by RNA-Seq short reads with the “real time” transcriptome from the same biological sample, we find that Trinity is best at the completeness by generating more assemblies than the alternative assemblers, but less continuous and having more misassemblies; Oases is best at the continuity and specificity, but less complete; The performance of SOAPdenovo-Trans, Trans-AByss and IDBA-Tran are in between of five assemblers. For evaluation methods, DETONATE leverages multiple aspects of the assembly set and ranks the assembly set with an average performance as the best, meanwhile the contig score can serve as a good metric to select assemblies with high completeness, specificity, continuity but not sensitive to misassemblies; TransRate contig score is useful for removing misassemblies, yet often the assemblies in the optimal set is too few to be used as a transcriptome.

Download Full-text

Ion channel profiling of the Lymnaea stagnalis ganglia via transcriptome analysis

10.21203/rs.3.rs-31358/v1 ◽

2020 ◽

Author(s):

Nan Dong ◽

Julia Bandura ◽

Zhaolei Zhang ◽

Yan Wang ◽

Karine Labadie ◽

...

Keyword(s):

Ion Channels ◽

Reference Genome ◽

Lymnaea Stagnalis ◽

De Novo ◽

Model Organisms ◽

Sequence Length ◽

Pond Snail ◽

Sequence Information ◽

Functional Domain ◽

Rna Seq

Abstract Background. The pond snail Lymnaea stagnalis (L. stagnalis) has been widely used as a model organism in neurobiology, ecotoxicology, and parasitology due to the relative simplicity of its CNS. However, its usefulness is restricted by a limited availability of transcriptome data. While sequence information for the L. stagnalis CNS transcripts has been obtained from EST library and a de novo RNA-seq assembly, the quality of these assemblies is limited by a combination of low coverage of EST libraries, the fragmented nature of de novo assemblies, and lack of reference genome. Results. In this study, taking advantage of the recent availability of the L. stagnalis reference genome, we generated an RNA-seq library from the adult L. stagnalis CNS, using a combination of genome-guided and de novo assembly programs to identify 17,832 protein-coding L. stagnalis transcripts. We combined our library with existing resources to produce a transcript set with greater sequence length, completeness, and diversity than previously available ones. Using our assembly and functional domain analysis, we profiled L. stagnalis CNS transcripts encoding ion channels and ionotropic receptors, which are key proteins for CNS function, and compared their sequences to other vertebrate and invertebrate model organisms. Interestingly, L. stagnalis transcripts encoding numerous putative Ca2+ channels showed the most sequence similarity to those of mouse, zebrafish, Xenopus tropicalis, fruit fly, and C. elegans, suggesting that many calcium channel-related signaling pathways may be evolutionarily conserved. Conclusions. Our study provides the most thorough characterization to date of the L. stagnalis transcriptome and provides insights into differences between vertebrates and invertebrates in CNS transcript diversity, according to function and protein class. Furthermore, this study is, to the best of our knowledge, the first to provide a complete characterization of the ion channels of a single species, opening new avenues for future research on fundamental neurobiological processes.

Download Full-text

S-conLSH: alignment-free gapped mapping of noisy long reads

BMC Bioinformatics ◽

10.1186/s12859-020-03918-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Angana Chakraborty ◽

Burkhard Morgenstern ◽

Sanghamitra Bandyopadhyay

Keyword(s):

Reference Genome ◽

Genome Mapping ◽

Sequence Data ◽

Downstream Processing ◽

Read Length ◽

Alignment Free ◽

Spaced Seeds ◽

Long Reads ◽

Gc Bias ◽

High Level

Abstract Background The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Conclusions S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.

Download Full-text