The Landscapes of Full-Length Transcripts and Splice Isoforms as Well as Transposons Exonization in the Lepidopteran Model System, Bombyx mori

The domesticated silkworm, Bombyx mori, is an important model system for the order Lepidoptera. Currently, based on third-generation sequencing, the chromosome-level genome of Bombyx mori has been released. However, its transcripts were mainly assembled by using short reads of second-generation sequencing and expressed sequence tags which cannot explain the transcript profile accurately. Here, we used PacBio Iso-Seq technology to investigate the transcripts from 45 developmental stages of Bombyx mori. We obtained 25,970 non-redundant high-quality consensus isoforms capturing ∼60% of previous reported RNAs, 15,431 (∼47%) novel transcripts, and identified 7,253 long non-coding RNA (lncRNA) with a large proportion of novel lncRNA (∼56%). In addition, we found that transposable elements (TEs) exonization account for 11,671 (∼45%) transcripts including 5,980 protein-coding transcripts (∼32%) and 5,691 lncRNAs (∼79%). Overall, our results expand the silkworm transcripts and have general implications to understand the interaction between TEs and their host genes. These transcripts resource will promote functional studies of genes and lncRNAs as well as TEs in the silkworm.

Download Full-text

IsoDetect: Detection of splice isoforms from third generation long reads based on short feature sequences

Current Bioinformatics ◽

10.2174/1574893615666200316101205 ◽

2020 ◽

Vol 15 ◽

Author(s):

Hongdong Li ◽

Wenjing Zhang ◽

Yuwen Luo ◽

Jianxin Wang

Keyword(s):

Sequence Similarity ◽

Detection Methods ◽

Sequence Information ◽

Third Generation ◽

Sequencing Data ◽

Splice Isoforms ◽

Third Generation Sequencing ◽

Long Reads ◽

Feature Sequence ◽

Generation Sequencing

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.

Download Full-text

Combined genomic, transcriptomic, and metabolomic analyses provide insights into chayote (Sechium edule) evolution and fruit development

Horticulture Research ◽

10.1038/s41438-021-00487-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Anzhen Fu ◽

Qing Wang ◽

Jianlou Mu ◽

Lili Ma ◽

Changlong Wen ◽

...

Keyword(s):

Fruit Development ◽

Repetitive Sequences ◽

Genetic Research ◽

Future Research ◽

Agricultural Crop ◽

Protein Coding ◽

Third Generation Sequencing ◽

Sechium Edule ◽

Generation Sequencing ◽

Cucurbitaceae Family

AbstractChayote (Sechium edule) is an agricultural crop in the Cucurbitaceae family that is rich in bioactive components. To enhance genetic research on chayote, we used Nanopore third-generation sequencing combined with Hi–C data to assemble a draft chayote genome. A chromosome-level assembly anchored on 14 chromosomes (N50 contig and scaffold sizes of 8.40 and 46.56 Mb, respectively) estimated the genome size as 606.42 Mb, which is large for the Cucurbitaceae, with 65.94% (401.08 Mb) of the genome comprising repetitive sequences; 28,237 protein-coding genes were predicted. Comparative genome analysis indicated that chayote and snake gourd diverged from sponge gourd and that a whole-genome duplication (WGD) event occurred in chayote at 25 ± 4 Mya. Transcriptional and metabolic analysis revealed genes involved in fruit texture, pigment, flavor, flavonoids, antioxidants, and plant hormones during chayote fruit development. The analysis of the genome, transcriptome, and metabolome provides insights into chayote evolution and lays the groundwork for future research on fruit and tuber development and genetic improvements in chayote.

Download Full-text

Single-cell RNA-seq analysis of mouse preimplantation embryos by third-generation sequencing

PLoS Biology ◽

10.1371/journal.pbio.3001017 ◽

2020 ◽

Vol 18 (12) ◽

pp. e3001017

Author(s):

Xiaoying Fan ◽

Dong Tang ◽

Yuhan Liao ◽

Pidong Li ◽

Yu Zhang ◽

...

Keyword(s):

Single Cell ◽

Developmental Stages ◽

Expression Patterns ◽

Embryonic Stem ◽

Preimplantation Embryos ◽

Specific Gene ◽

Third Generation ◽

Specific Expression ◽

Third Generation Sequencing ◽

Generation Sequencing

The development of next generation sequencing (NGS) platform-based single-cell RNA sequencing (scRNA-seq) techniques has tremendously changed biological researches, while there are still many questions that cannot be addressed by them due to their short read lengths. We developed a novel scRNA-seq technology based on third-generation sequencing (TGS) platform (single-cell amplification and sequencing of full-length RNAs by Nanopore platform, SCAN-seq). SCAN-seq exhibited high sensitivity and accuracy comparable to NGS platform-based scRNA-seq methods. Moreover, we captured thousands of unannotated transcripts of diverse types, with high verification rate by reverse transcription PCR (RT-PCR)–coupled Sanger sequencing in mouse embryonic stem cells (mESCs). Then, we used SCAN-seq to analyze the mouse preimplantation embryos. We could clearly distinguish cells at different developmental stages, and a total of 27,250 unannotated transcripts from 9,338 genes were identified, with many of which showed developmental stage-specific expression patterns. Finally, we showed that SCAN-seq exhibited high accuracy on determining allele-specific gene expression patterns within an individual cell. SCAN-seq makes a major breakthrough for single-cell transcriptome analysis field.

Download Full-text

Non-Coding RNA Databases in Cardiovascular Research

Non-Coding RNA ◽

10.3390/ncrna6030035 ◽

2020 ◽

Vol 6 (3) ◽

pp. 35

Author(s):

Deepak Balamurali ◽

Monika Stoll

Keyword(s):

Vascular System ◽

Cardiovascular Research ◽

Data Repositories ◽

Protein Coding ◽

Rna Molecules ◽

High Throughput Data ◽

Non Coding Rna ◽

Non Coding Rnas ◽

Generation Sequencing

Cardiovascular diseases (CVDs) are of multifactorial origin and can be attributed to several genetic and environmental components. CVDs are the leading cause of mortality worldwide and they primarily damage the heart and the vascular system. Non-coding RNA (ncRNA) refers to functional RNA molecules, which have been transcribed into DNA but do not further get translated into proteins. Recent transcriptomic studies have identified the presence of thousands of ncRNA molecules across species. In humans, less than 2% of the total genome represents the protein-coding genes. While the role of many ncRNAs is yet to be ascertained, some long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) have been associated with disease progression, serving as useful diagnostic and prognostic biomarkers. A plethora of data repositories specialized in ncRNAs have been developed over the years using publicly available high-throughput data from next-generation sequencing and other approaches, that cover various facets of ncRNA research like basic and functional annotation, expressional profile, structural and molecular changes, and interaction with other biomolecules. Here, we provide a compendium of the current ncRNA databases relevant to cardiovascular research.

Download Full-text

Assembly and Analysis of the Complete Mitochondrial Genome of Capsella bursa-pastoris

Plants ◽

10.3390/plants9040469 ◽

2020 ◽

Vol 9 (4) ◽

pp. 469

Author(s):

Denis O. Omelchenko ◽

Maxim S. Makarenko ◽

Artem S. Kasianov ◽

Mikhail I. Schelkunov ◽

Maria D. Logacheva ◽

...

Keyword(s):

Amino Acids ◽

Rna Editing ◽

Complete Mitochondrial Genome ◽

Open Reading Frames ◽

Protein Coding ◽

Third Generation Sequencing ◽

Rnaseq Data ◽

Complete Mitogenome ◽

Generation Sequencing ◽

Reading Frames

Shepherd’s purse (Capsella bursa-pastoris) is a cosmopolitan annual weed and a promising model plant for studying allopolyploidization in the evolution of angiosperms. Though plant mitochondrial genomes are a valuable source of genetic information, they are hard to assemble. At present, only the complete mitogenome of C. rubella is available out of all species of the genus Capsella. In this work, we have assembled the complete mitogenome of C. bursa-pastoris using high-precision PacBio SMRT third-generation sequencing technology. It is 287,799 bp long and contains 32 protein-coding genes, 3 rRNAs, 25 tRNAs corresponding to 15 amino acids, and 8 open reading frames (ORFs) supported by RNAseq data. Though many repeat regions have been found, none of them is longer than 1 kbp, and the most frequent structural variant originated from these repeats is present in only 4% of the mitogenome copies. The mitochondrial DNA sequence of C. bursa-pastoris differs from C. rubella, but not from C. orientalis, by two long inversions, suggesting that C. orientalis could be its maternal progenitor species. In total, 377 C to U RNA editing sites have been detected. All genes except cox1 and atp8 contain RNA editing sites, and most of them lead to non-synonymous changes of amino acids. Most of the identified RNA editing sites are identical to corresponding RNA editing sites in A. thaliana.

Download Full-text

Whole-Genome Sequencing and Potassium-Solubilizing Mechanism of Bacillus aryabhattai SK1-7

Frontiers in Microbiology ◽

10.3389/fmicb.2021.722379 ◽

2022 ◽

Vol 12 ◽

Author(s):

Yifan Chen ◽

Hui Yang ◽

Zizhu Shen ◽

Jianren Ye

Keyword(s):

High Performance ◽

Fermentation Broth ◽

Culture Conditions ◽

Whole Genome ◽

Expression Levels ◽

Bacillus Aryabhattai ◽

Third Generation Sequencing ◽

Second Generation Sequencing ◽

Sulfuric Acid Method ◽

Generation Sequencing

To analyze the whole genome of Bacillus aryabhattai strain SK1-7 and explore its potassium solubilization characteristics and mechanism, thus providing a theoretical basis for analyzing the utilization and improvement of insoluble potassium resources in soil. Genome information for Bacillus aryabhattai SK1-7 was obtained by using Illumina NovaSeq second-generation sequencing and GridION Nanopore ONT third-generation sequencing technology. The contents of organic acids and polysaccharides in fermentation broth of Bacillus aryabhattai SK1-7 were determined by high-performance liquid chromatography and the anthrone sulfuric acid method, and the expression levels of the potassium solubilization-related genes ackA, epsB, gltA, mdh and ppc were compared by real-time fluorescence quantitative PCR under different potassium source culture conditions. The whole genome of the strain consisted of a complete chromosome sequence and four plasmid sequences. The sequence sizes of the chromosomes and plasmids P1, P2, P3 and P4 were 5,188,391 bp, 136,204 bp, 124,862 bp, 67,200 bp and 12,374 bp, respectively. The GC contents were 38.2, 34.4, 33.6, 32.8, and 33.7%. Strain SK1-7 mainly secreted malic, formic, acetic and citric acids under culture with an insoluble potassium source. The polysaccharide content produced with an insoluble potassium source was higher than that with a soluble potassium source. The expression levels of five potassium solubilization-related genes with the insoluble potassium source were higher than those with the soluble potassium source.

Download Full-text

Linking De Novo Assembly Results with Long DNA Reads Using the dnaasm-link Application

BioMed Research International ◽

10.1155/2019/7847064 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10

Author(s):

Wiktor Kuśmirek ◽

Wiktor Franus ◽

Robert Nowak

Keyword(s):

Dna Sequences ◽

De Novo ◽

Computation Time ◽

Third Generation ◽

Next Generation ◽

Sequencing Data ◽

Third Generation Sequencing ◽

Combining Data ◽

Second Generation Sequencing ◽

Generation Sequencing

Currently, third-generation sequencing techniques, which make it possible to obtain much longer DNA reads compared to the next-generation sequencing technologies, are becoming more and more popular. There are many possibilities for combining data from next-generation and third-generation sequencing. Herein, we present a new application called dnaasm-link for linking contigs, the result of de novo assembly of second-generation sequencing data, with long DNA reads. Our tool includes an integrated module to fill gaps with a suitable fragment of an appropriate long DNA read, which improves the consistency of the resulting DNA sequences. This feature is very important, in particular for complex DNA regions. Our implementation is found to outperform other state-of-the-art tools in terms of speed and memory requirements, which may enable its usage for organisms with a large genome, something which is not possible in existing applications. The presented application has many advantages: (i) it significantly optimizes memory and reduces computation time; (ii) it fills gaps with an appropriate fragment of a specified long DNA read; (iii) it reduces the number of spanned and unspanned gaps in existing genome drafts. The application is freely available to all users under GNU Library or Lesser General Public License version 3.0 (LGPLv3). The demo application, Docker image, and source code can be downloaded from project homepage.

Download Full-text

Evaluating approaches to find exon chains based on long reads

10.1101/066241 ◽

2016 ◽

Author(s):

Anna Kuosmanen ◽

Veli Mäkinen

Keyword(s):

Second Generation ◽

Simulated Data ◽

Error Rates ◽

Third Generation ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

AbstractMotivationTranscript prediction can be modelled as a graph problem where exons are modelled as nodes and reads spanning two or more exons are modelled as exon chains. PacBio third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions.ResultsWe survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity / precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy.AvailabilityThe simulated data and in-house scripts used for this article are available at http://cs.helsinki.fi/u/aekuosma/exon_chain_evaluation_publish.tar.gz.

Download Full-text

Third-generation sequencing and the future of genomics

10.1101/048603 ◽

2016 ◽

Cited By ~ 42

Author(s):

Hayan Lee ◽

James Gurtowski ◽

Shinjae Yoo ◽

Maria Nattestad ◽

Shoshana Marcus ◽

...

Keyword(s):

Single Molecule ◽

Structural Variation ◽

De Novo ◽

Third Generation ◽

Base Pairs ◽

Haplotype Phasing ◽

Third Generation Sequencing ◽

Second Generation Sequencing ◽

High Quality Genome ◽

Generation Sequencing

AbstractThird-generation long-range DNA sequencing and mapping technologies are creating a renaissance in high-quality genome sequencing. Unlike second-generation sequencing, which produces short reads a few hundred base-pairs long, third-generation single-molecule technologies generate over 10,000 bp reads or map over 100,000 bp molecules. We analyze how increased read lengths can be used to address longstanding problems in de novo genome assembly, structural variation analysis and haplotype phasing.

Download Full-text

Integrate Heterogeneous NGS and TGS Data to Boost Genome-free Transcriptome Research

10.1101/2020.05.27.117796 ◽

2020 ◽

Author(s):

Yangmei Qin ◽

Zhe Lin ◽

Dan Shi ◽

Mindong Zhong ◽

Te An ◽

...

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Computational Method ◽

Protein Coding ◽

Third Generation Sequencing ◽

A Genome ◽

Amphioxus Genome ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

AbstractIt is a long-term challenge to undertake reliable transcriptomic research under different circumstances of genome availability. Here, we newly developed a genome-free computational method to aid accurate transcriptome assembly, using the amphioxus as the example. Via integrating ten next generation sequencing (NGS) transcriptome datasets and one third-generation sequencing (TGS) dataset, we built a sequence library of non-redundant expressed transcripts for the amphioxus. The library consisted of overall 91,915 distinct transcripts, 51,549 protein-coding transcripts, and 16,923 novel extragenic transcripts. This substantially improved current amphioxus genome annotation by expanding the distinct gene number from 21,954 to 38,777. We consolidated the library significantly outperformed the genome, as well as de novo method, in transcriptome assembly from multiple aspects. For convenience, we curated the Integrative Transcript Library database of the amphioxus (http://www.bio-add.org/InTrans/). In summary, this work provides a practical solution for most organisms to alleviate the heavy dependence on good quality genome in transcriptome research. It also ensures the amphioxus transcriptome research grounding on reliable data.

Download Full-text