AStrap: identification of alternative splicing from transcript sequences without a reference genome

2018 ◽  
Vol 35 (15) ◽  
pp. 2654-2656 ◽  
Author(s):  
Guoli Ji ◽  
Wenbin Ye ◽  
Yaru Su ◽  
Moliang Chen ◽  
Guangzao Huang ◽  
...  

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

2016 ◽  
Author(s):  
Nadia M Davidson ◽  
Anthony DK Hawkins ◽  
Alicia Oshlack

AbstractNumerous methods have been developed to analyse RNA sequencing data, but most rely on the availability of a reference genome, making them unsuitable for non-model organisms. De novo transcriptome assembly can build a reference transcriptome from the non-model sequencing data, but falls short of allowing most tools to be applied. Here we present superTranscripts, a simple but powerful solution to bridge that gap. SuperTranscripts are a substitute for a reference genome, consisting of all the unique exonic sequence, in transcriptional order, such that each gene is represented by a single sequence. We demonstrate how superTranscripts allow visualization, variant detection and differential isoform detection in non-model organisms, using widely applied methods that are designed to work with reference genomes. SuperTranscripts can also be applied to model organisms to enhance visualization and discover novel expressed sequence. We describe Lace, software to construct superTranscripts from any set of transcripts including de novo assembled transcriptomes. In addition we used Lace to combine reference and assembled transcriptomes for chicken and recovered the sequence of hundreds of gaps in the reference genome.


2022 ◽  
Author(s):  
Karl Johan Westrin ◽  
Warren W Kretzschmar ◽  
Olof Emanuelsson

Motivation: Transcriptome assembly from RNA sequencing data in species without a reliable reference genome has to be performed de novo, but studies have shown that de novo methods often have inadequate reconstruction ability of transcript isoforms. This impedes the study of alternative splicing, in particular for lowly expressed isoforms. Result: We present the de novo transcript isoform assembler ClusTrast, which clusters a set of guiding contigs by similarity, aligns short reads to the guiding contigs, and assembles each clustered set of short reads individually. We tested ClusTrast on datasets from six eukaryotic species, and showed that ClusTrast reconstructed more expressed known isoforms than any of the other tested de novo assemblers, at a moderate reduction in precision. An appreciable fraction were reconstructed to at least 95% of their length. We suggest that ClusTrast will be useful for studying alternative splicing in the absence of a reference genome. Availability and implementation: The code and usage instructions are available at https://github.com/karljohanw/clustrast.


2019 ◽  
Vol 35 (17) ◽  
pp. 3119-3126 ◽  
Author(s):  
Huiyuan Wang ◽  
Huihui Wang ◽  
Hangxiao Zhang ◽  
Sheng Liu ◽  
Yongsheng Wang ◽  
...  

Abstract Motivation MicroRNA (miRNA) and alternative splicing (AS)-mediated post-transcriptional regulation has been extensively studied in most eukaryotes. However, the interplay between AS and miRNAs has not been explored in plants. To our knowledge, the overall profile of miRNA target sites in circular RNAs (circRNA) generated by alternative back splicing has never been reported previously. To address the challenge, we identified miRNA target sites located in alternatively spliced regions of the linear and circular splice isoforms using the up-to-date single-molecule real-time (SMRT) isoform sequencing (Iso-Seq) and Illumina sequencing data in eleven plant species. Results In total, we identified 399 401 and 114 574 AS events from linear and circular RNAs, respectively. Among them, there were 64 781 and 41 146 miRNA target sites located in linear and circular AS region, respectively. In addition, we found 38 913 circRNAs to be overlapping with 45 648 AS events of its own parent isoforms, suggesting circRNA regulation of AS of linear RNAs by forming R-loop with the genomic locus. Here, we present a comprehensive database of miRNA targets in alternatively spliced linear and circRNAs (ASmiR) and a web server for deposition and identification of miRNA target sites located in the alternatively spliced region of linear and circular RNAs. This database is accompanied by an easy-to-use web query interface for meaningful downstream analysis. Plant research community can submit user-defined datasets to the web service to search AS regions harboring small RNA target sites. In conclusion, this study provides an unprecedented resource to understand regulatory relationships between miRNAs and AS in both gymnosperms and angiosperms. Availability and implementation The readily accessible database and web-based tools are available at http://forestry.fafu.edu.cn/bioinfor/db/ASmiR. Supplementary information Supplementary data are available at Bioinformatics online.


GigaScience ◽  
2019 ◽  
Vol 8 (12) ◽  
Author(s):  
Hui-Su Kim ◽  
Sungwon Jeon ◽  
Changjae Kim ◽  
Yeon Kyung Kim ◽  
Yun Sung Cho ◽  
...  

Abstract Background Long DNA reads produced by single-molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short-read DNA fragments. For de novo assembly, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the favorite options. However, PacBio's SMRT sequencing is expensive for a full human genome assembly and costs more than $40,000 US for 30× coverage as of 2019. ONT PromethION sequencing, on the other hand, is 1/12 the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio's SMRT sequencing in relation to the quality. Findings We performed whole-genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64× coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mb and a total genome length of 2.8 Gb. It was comparable to a KOREF assembly constructed using PacBio at 62× coverage (188 Gb, 2,695 contigs, and N50s of 17.9 Mb). When we applied Hi-C–derived long-range mapping data, an even higher quality assembly for the 64× coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mb. Conclusion The pore-based PromethION approach provided a high-quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and was more cost-effective than PacBio at comparable quality measurements.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Hai-Feng Tian ◽  
Qiao-Mu Hu ◽  
Zhong Li

Abstract The swamp eel (Monopterus albus) is one economically important fish in China and South-Eastern Asia and a good model species to study sex inversion. There are different genetic lineages and multiple local strains of swamp eel in China, and one local strain of M. albus with deep yellow and big spots has been selected for consecutive selective breeding due to superiority in growth rate and fecundity. A high-quality reference genome of the swamp eel would be a very useful resource for future selective breeding program. In the present study, we applied PacBio single-molecule sequencing technique (SMRT) and the high-throughput chromosome conformation capture (Hi-C) technologies to assemble the M. albus genome. A 799 Mb genome was obtained with the contig N50 length of 2.4 Mb and scaffold N50 length of 67.24 Mb, indicating 110-fold and ∼31.87-fold improvement compared to the earlier released assembly (∼22.24 Kb and 2.11 Mb, respectively). Aided with Hi-C data, a total of 750 contigs were reliably assembled into 12 chromosomes. Using 22,373 protein-coding genes annotated here, the phylogenetic relationships of the swamp eel with other teleosts showed that swamp eel separated from the common ancestor of Zig-zag eel ∼49.9 million years ago, and 769 gene families were found expanded, which are mainly enriched in the immune system, sensory system, and transport and catabolism. This highly accurate, chromosome-level reference genome of M. albus obtained in this work will be used for the development of genome-scale selective breeding.


2020 ◽  
Author(s):  
Adrian Casanova ◽  
Francesco Maroso ◽  
Andrés Blanco ◽  
Miguel Hermida ◽  
Nestor Rios ◽  
...  

Abstract Background: The irruption of Next-generation sequencing (NGS) and restriction site-associated DNA sequencing (RAD-seq) in the last decade has led to the identification of thousands of molecular markers and their genotyping for refined genomic screening. This approach has been especially useful for non-model organisms with limited genomic resources. Many building-loci pipelines have been developed to obtain robust single nucleotide polymorphism (SNPs) genotyping datasets using a de novo RAD-seq approach, i.e. without reference genomes. Here, the performances of two building-loci pipelines, STACKS 2 and Meyer’s 2b-RAD v2.1 pipeline, were compared using a diverse set of aquatic species representing different genomic and/or population structure scenarios. Two bivalve species (Manila clam and common edible cockle) and three fish species (brown trout, silver catfish and small-spotted catshark) were studied. Four SNP panels were evaluated in each species to test both different building-loci pipelines and criteria for SNP selection. Furthermore, for Manila clam and brown trout, a reference genome approach was used as control. Results: Despite different outcomes were observed between pipelines and species with the diverse SNP calling and filtering steps tested, no remarkable differences were found on genetic diversity and differentiation within species with the SNP panels obtained with a de novo approach. The main differences were found in brown trout between the de novo and reference genome approaches. Genotyped vs missing data mismatches were the main genotyping difference detected between the two building-loci pipelines or between the de novo and reference genome comparisons. Conclusions: Tested building-loci pipelines seem not to have a substantial influence on population genetics inference. Preliminary trials with bioinformatic pipelines are suggested to evaluate their influence in population parameters related to the specific goals of the study.


2019 ◽  
Author(s):  
Hui-Su Kim ◽  
Sungwon Jeon ◽  
Changjae Kim ◽  
Yeon Kyung Kim ◽  
Yun Sung Cho ◽  
...  

AbstractBackgroundLong DNA reads produced by single molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short read DNA fragments. For de novo assembly, PacBio and Oxford Nanopore Technologies (ONT) are favorite options. However, PacBio’s SMRT sequencing is expensive for a full human genome assembly and costs over 40,000 USD for 30x coverage as of 2019. ONT PromethION sequencing, on the other hand, is one-twelfth the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio’s SMRT sequencing in relation to the quality.FindingsWe performed whole genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64x coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mbp and a total genome length of 2.8 Gbp. It was comparable to a KOREF assembly constructed using PacBio at 62x coverage (188 Gbp, 2,695 contigs and N50s of 17.9 Mbp). When we applied Hi-C-derived long-range mapping data, an even higher quality assembly for the 64x coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mbp.ConclusionThe pore-based PromethION approach provides a good quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and is more cost-effective than PacBio at comparable quality measurements.


Animals ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 2226
Author(s):  
Sazia Kunvar ◽  
Sylwia Czarnomska ◽  
Cino Pertoldi ◽  
Małgorzata Tokarska

The European bison is a non-model organism; thus, most of its genetic and genomic analyses have been performed using cattle-specific resources, such as BovineSNP50 BeadChip or Illumina Bovine 800 K HD Bead Chip. The problem with non-specific tools is the potential loss of evolutionary diversified information (ascertainment bias) and species-specific markers. Here, we have used a genotyping-by-sequencing (GBS) approach for genotyping 256 samples from the European bison population in Bialowieza Forest (Poland) and performed an analysis using two integrated pipelines of the STACKS software: one is de novo (without reference genome) and the other is a reference pipeline (with reference genome). Moreover, we used a reference pipeline with two different genomes, i.e., Bos taurus and European bison. Genotyping by sequencing (GBS) is a useful tool for SNP genotyping in non-model organisms due to its cost effectiveness. Our results support GBS with a reference pipeline without PCR duplicates as a powerful approach for studying the population structure and genotyping data of non-model organisms. We found more polymorphic markers in the reference pipeline in comparison to the de novo pipeline. The decreased number of SNPs from the de novo pipeline could be due to the extremely low level of heterozygosity in European bison. It has been confirmed that all the de novo/Bos taurus and Bos taurus reference pipeline obtained SNPs were unique and not included in 800 K BovineHD BeadChip.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nae-Chyun Chen ◽  
Brad Solomon ◽  
Taher Mun ◽  
Sheila Iyer ◽  
Ben Langmead

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.


2021 ◽  
Author(s):  
Myung-Shin Kim ◽  
Taeyoung Lee ◽  
Jeonghun Baek ◽  
Ji Hong Kim ◽  
Changhoon Kim ◽  
...  

AbstractMassive resequencing efforts have been undertaken to catalog allelic variants in major crop species including soybean, but the scope of the information for genetic variation often depends on short sequence reads mapped to the extant reference genome. Additional de novo assembled genome sequences provide a unique opportunity to explore a dispensable genome fraction in the pan-genome of a species. Here, we report the de novo assembly and annotation of Hwangkeum, a popular soybean cultivar in Korea. The assembly was constructed using PromethION nanopore sequencing data and two genetic maps, and was then error-corrected using Illumina short-reads and PacBio SMRT reads. The 933.12 Mb assembly was annotated 79,870 transcripts for 58,550 genes using RNA-Seq data and the public soybean annotation set. Comparison of the Hwangkeum assembly with the Williams 82 soybean reference genome sequence revealed 1.8 million single-nucleotide polymorphisms, 0.5 million indels, and 25 thousand putative structural variants. However, there was no natural megabase-scale chromosomal rearrangement. Incidentally, by adding two novel groups, we found that soybean contains four clearly separated groups of centromeric satellite repeats. Analyses of satellite repeats and gene content suggested that the Hwangkeum assembly is a high-quality assembly. This was further supported by comparison of the marker arrangement of anthocyanin biosynthesis genes and of gene arrangement at the Rsv3 locus. Therefore, the results indicate that the de novo assembly of Hwangkeum is a valuable additional reference genome resource for characterizing traits for the improvement of this important crop species.


Sign in / Sign up

Export Citation Format

Share Document