ontoFAST: An R package for interactive and semi-automatic annotation of characters with biological ontologies

Mapping Intimacies ◽

10.1101/2021.05.11.443562 ◽

2021 ◽

Author(s):

Sergei Tarasov ◽

Istvan MIko ◽

Matthew Jon Yoder

Keyword(s):

R Package ◽

Automatic Annotation ◽

Data Interoperability ◽

Biological Ontologies ◽

Interactive Interface ◽

Phylogenetic Methods ◽

Helper Function ◽

Downstream Analysis ◽

Further Development

The commonly used Entity-Quality (EQ) syntax provides rich semantics and high granularity for annotating phenotypes and characters using ontologies. However, EQ syntax might be time inefficient if this granularity is unnecessary for downstream analysis. We present an R package ontoFAST that aid production of fast annotations of characters and character matrices with biological ontologies. Its interactive interface allows quick and convenient tagging of character statements with necessary ontology terms. The annotations produced in ontoFAST can be exported in csv format for downstream analysis. Additinally, OntoFAST provides: (i) functions for constructing simple queries of characters against ontologies, and (ii) helper function for exporting and visualising complex ontological hierarchies and their relationships. OntoFAST enhances data interoperability between various applications and support further integration of ontological and phylogenetic methods. Ontology tools are underrepresented in R environment and we hope that ontoFAST will stimulate their further development.

Download Full-text

ontoFAST: An R package for interactive and semi‐automatic annotation of characters with biological ontologies

Methods in Ecology and Evolution ◽

10.1111/2041-210x.13753 ◽

2021 ◽

Author(s):

Sergei Tarasov ◽

István Mikó ◽

Matthew Jon Yoder

Keyword(s):

R Package ◽

Automatic Annotation ◽

Biological Ontologies

Download Full-text

BloodGen3Module: Blood transcriptional module repertoire analysis and visualization using R

Bioinformatics ◽

10.1093/bioinformatics/btab121 ◽

2021 ◽

Author(s):

Darawan Rinchai ◽

Jessica Roelands ◽

Mohammed Toufiq ◽

Wouter Hendrickx ◽

Matthew C Altman ◽

...

Keyword(s):

Transcript Abundance ◽

R Package ◽

Supplementary Information ◽

Illustrative Case ◽

Bioinformatic Tools ◽

Transcriptional Module ◽

Wide Range ◽

Downstream Analysis ◽

Computing Module ◽

Parallel Workflow

Abstract Motivation We previously described the construction and characterization of generic and reusable blood transcriptional module repertoires. More recently we released a third iteration (“BloodGen3” module repertoire) that comprises 382 functionally annotated gene sets (modules) and encompasses 14,168 transcripts. Custom bioinformatic tools are needed to support downstream analysis, visualization and interpretation relying on such fixed module repertoires. Results We have developed and describe here a R package, BloodGen3Module. The functions of our package permit group comparison analyses to be performed at the module-level, and to display the results as annotated fingerprint grid plots. A parallel workflow for computing module repertoire changes for individual samples rather than groups of samples is also available; these results are displayed as fingerprint heatmaps. An illustrative case is used to demonstrate the steps involved in generating blood transcriptome repertoire fingerprints of septic patients. Taken together, this resource could facilitate the analysis and interpretation of changes in blood transcript abundance observed across a wide range of pathological and physiological states. Availability The BloodGen3Module package and documentation are freely available from Github: https://github.com/Drinchai/BloodGen3Module Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TrancriptomeReconstructoR, A Data-Driven Annotation of Complex Transcriptomes

10.21203/rs.3.rs-131404/v1 ◽

2020 ◽

Author(s):

Maxim Ivanov ◽

Albin Sandelin ◽

Sebastian Marquardt

Keyword(s):

De Novo ◽

Gene Annotation ◽

R Package ◽

Sequence Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Model ◽

Preparation Methods ◽

Downstream Analysis

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

Download Full-text

Treeio: An R Package for Phylogenetic Tree Input and Output with Richly Annotated and Associated Data

Molecular Biology and Evolution ◽

10.1093/molbev/msz240 ◽

2019 ◽

Vol 37 (2) ◽

pp. 599-603 ◽

Cited By ~ 25

Author(s):

Li-Gen Wang ◽

Tommy Tsan-Yuk Lam ◽

Shuangbin Xu ◽

Zehan Dai ◽

Lang Zhou ◽

...

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

R Package ◽

External Data ◽

Input And Output ◽

Evolutionary Context ◽

Tree Data ◽

Downstream Analysis ◽

Different Sources ◽

Associated Data

Abstract Phylogenetic trees and data are often stored in incompatible and inconsistent formats. The outputs of software tools that contain trees with analysis findings are often not compatible with each other, making it hard to integrate the results of different analyses in a comparative study. The treeio package is designed to connect phylogenetic tree input and output. It supports extracting phylogenetic trees as well as the outputs of commonly used analytical software. It can link external data to phylogenies and merge tree data obtained from different sources, enabling analyses of phylogeny-associated data from different disciplines in an evolutionary context. Treeio also supports export of a phylogenetic tree with heterogeneous-associated data to a single tree file, including BEAST compatible NEXUS and jtree formats; these facilitate data sharing as well as file format conversion for downstream analysis. The treeio package is designed to work with the tidytree and ggtree packages. Tree data can be processed using the tidy interface with tidytree and visualized by ggtree. The treeio package is released within the Bioconductor and rOpenSci projects. It is available at https://www.bioconductor.org/packages/treeio/.

Download Full-text

PTMscape: an open source tool to predict generic post-translational modifications and map hotspots of modification crosstalk

10.1101/257386 ◽

2018 ◽

Author(s):

Ginny X.H. Li ◽

Christine Vogel ◽

Hyungwon Choi

Keyword(s):

R Package ◽

Rna Recognition ◽

Functional Protein ◽

Rna Recognition Motifs ◽

Post Translational Modifications ◽

Physico Chemical ◽

Chemical Microenvironment ◽

Recognition Motifs ◽

Downstream Analysis ◽

Regulation Of Cell Cycle

AbstractWhile tandem mass spectrometry can now detect post-translational modifications (PTM) at the proteome scale, reported modification sites are often incomplete and include false positives. Computational approaches can complement these datasets by additional predictions, but most available tools are tailored for single modifications and each tool uses different features for prediction. We developed an R package called PTMscape which predicts modifications sites across the proteome based on a unified and comprehensive set of descriptors of the physico-chemical microenvironment of modified sites, with additional downstream analysis modules to test enrichment of individual or pairs of modifications in functional protein regions. PTMscape is generic in the ability to process any major modifications, such as phosphorylation and ubiquitination, while achieving the sensitivity and specificity comparable to single-PTM methods and outperforming other multi-PTM tools. Maintaining generalizability of the framework, we expanded proteome-wide coverage of five major modifications affecting different residues by prediction and performed combinatorial analysis for spatial co-occurrence of pairs of those modifications. This analysis revealed potential modification hotspots and crosstalk among multiple PTMs in key protein domains such as histone, protein kinase, and RNA recognition motifs, spanning various biological processes such as RNA processing, DNA damage response, signal transduction, and regulation of cell cycle. These results provide a proteome-scale analysis of crosstalk among major PTMs and can be easily extended to other modifications.Contactall correspondence should be addressed to [email protected].

Download Full-text

Identification, annotation and visualisation of extreme changes in splicing from RNA-seq experiments with SwitchSeq

10.1101/005967 ◽

2014 ◽

Cited By ~ 6

Author(s):

Mar Gonzàlez-Porta ◽

Alvis Brazma

Keyword(s):

Enrichment Analysis ◽

R Package ◽

Third Party ◽

Pathway Enrichment Analysis ◽

Rna Seq ◽

Differential Splicing ◽

Protein Coding ◽

Downstream Analysis ◽

Intuitive Manner ◽

Abundant Transcript

In the past years, RNA sequencing has become the method of choice for the study of transcriptome composition. When working with this type of data, several tools exist to quantify differences in splicing across conditions and to address the significance of those changes. However, the number of genes predicted to undergo differential splicing is often high, and further interpretation of the results becomes a challenging task. Here we present SwitchSeq, a novel set of tools designed to help the users in the interpretation of differential splicing events that affect protein coding genes. More specifically, we provide a framework to identify switch events, i.e., cases where, for a given gene, the identity of the most abundant transcript changes across conditions. The identified events are then annotated by incorporating information from several public databases and third-party tools, and are further visualised in an intuitive manner with the independent R package tviz. All the results are displayed in a self-contained HTML document, and are also stored in txt and json format to facilitate the integration with any further downstream analysis tools. Such analysis approach can be used complementarily to Gene Ontology and pathway enrichment analysis, and can also serve as an aid in the validation of predicted changes in mRNA and protein abundance. The latest version of SwitchSeq, including installation instructions and use cases, can be found at https://github.com/mgonzalezporta/SwitchSeq. Additionally, the plot capabilities are provided as an independent R package at https://github.com/mgonzalezporta/tviz.

Download Full-text

TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

10.1101/2020.12.10.418897 ◽

2020 ◽

Author(s):

Maxim Ivanov ◽

Albin Sandelin ◽

Sebastian Marquardt

Keyword(s):

De Novo ◽

Gene Annotation ◽

R Package ◽

Sequence Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Model ◽

Preparation Methods ◽

Downstream Analysis

AbstractBackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

Download Full-text

EnImpute: imputing dropout events in single-cell RNA-sequencing data via ensemble learning

Bioinformatics ◽

10.1093/bioinformatics/btz435 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4827-4829 ◽

Cited By ~ 6

Author(s):

Xiao-Fei Zhang ◽

Le Ou-Yang ◽

Shuo Yang ◽

Xing-Ming Zhao ◽

Xiaohua Hu ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Ensemble Learning ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

The Individual ◽

Downstream Analysis ◽

Shiny Application

Abstract Summary Imputation of dropout events that may mislead downstream analyses is a key step in analyzing single-cell RNA-sequencing (scRNA-seq) data. We develop EnImpute, an R package that introduces an ensemble learning method for imputing dropout events in scRNA-seq data. EnImpute combines the results obtained from multiple imputation methods to generate a more accurate result. A Shiny application is developed to provide easier implementation and visualization. Experiment results show that EnImpute outperforms the individual state-of-the-art methods in almost all situations. EnImpute is useful for correcting the noisy scRNA-seq data before performing downstream analysis. Availability and implementation The R package and Shiny application are available through Github at https://github.com/Zhangxf-ccnu/EnImpute. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Bayesian inference of ancestral dates on bacterial phylogenetic trees

10.1101/347385 ◽

2018 ◽

Author(s):

Xavier Didelot ◽

Nicholas J Croucher ◽

Stephen D Bentley ◽

Simon R Harris ◽

Daniel J Wilson

Keyword(s):

Phylogenetic Trees ◽

Single Species ◽

R Package ◽

Bacterial Genomes ◽

Phylogenetic Methods ◽

Bacterial Genomics ◽

Wide Range ◽

Genomic Studies ◽

Dated Phylogeny ◽

Phylogenetic Method

ABSTRACTThe sequencing and comparative analysis of a collection of bacterial genomes from a single species or lineage of interest can lead to key insights into its evolution, ecology or epidemiology. The tool of choice for such a study is often to build a phylogenetic tree, and more specifically when possible a dated phylogeny, in which the dates of all common ancestors are estimated. Here we propose a new Bayesian methodology to construct dated phylogenies which is specifically designed for bacterial genomics. Unlike previous Bayesian methods aimed at building dated phylogenies, we consider that the phylogenetic relationships between the genomes have been previously evaluated using a standard phylogenetic method, which makes our methodology much faster and scalable. This two-steps approach also allows us to directly exploit existing phylogenetic methods that detect bacterial recombination, and therefore to account for the effect of recombination in the construction of a dated phylogeny. We analysed many simulated datasets in order to benchmark the performance of our approach in a wide range of situations. Furthermore, we present applications to three different real datasets from recent bacterial genomic studies. Our methodology is implemented in a R package called BactDating which is freely available for download at https://github.com/xavierdidelot/BactDating.

Download Full-text

Genomic Metrics Applied to Rhizobiales (Hyphomicrobiales): Species Reclassification, Identification of Unauthentic Genomes and False Type Strains

Frontiers in Microbiology ◽

10.3389/fmicb.2021.614957 ◽

2021 ◽

Vol 12 ◽

Author(s):

Camila Gazolla Volpiano ◽

Fernando Hayashi Sant’Anna ◽

Adriana Ambrosini ◽

Jackson Freitas Brilhante de São José ◽

Anelise Beneduzi ◽

...

Keyword(s):

Bacterial Species ◽

R Package ◽

High Identity ◽

Mesorhizobium Loti ◽

Aurantimonas Coralicida ◽

Downstream Analysis ◽

Chelatobacter Heintzii ◽

Species Descriptions ◽

Type Strains ◽

Term Type

Taxonomic decisions within the order Rhizobiales have relied heavily on the interpretations of highly conserved 16S rRNA sequences and DNA–DNA hybridizations (DDH). Currently, bacterial species are defined as including strains that present 95–96% of average nucleotide identity (ANI) and 70% of digital DDH (dDDH). Thus, ANI values from 520 genome sequences of type strains from species of Rhizobiales order were computed. From the resulting 270,400 comparisons, a ≥95% cut-off was used to extract high identity genome clusters through enumerating maximal cliques. Coupling this graph-based approach with dDDH from clusters of interest, it was found that: (i) there are synonymy between Aminobacter lissarensis and Aminobacter carboxidus, Aurantimonas manganoxydans and Aurantimonas coralicida, “Bartonella mastomydis,” and Bartonella elizabethae, Chelativorans oligotrophicus, and Chelativorans multitrophicus, Rhizobium azibense, and Rhizobium gallicum, Rhizobium fabae, and Rhizobium pisi, and Rhodoplanes piscinae and Rhodoplanes serenus; (ii) Chelatobacter heintzii is not a synonym of Aminobacter aminovorans; (iii) “Bartonella vinsonii” subsp. arupensis and “B. vinsonii” subsp. berkhoffii represent members of different species; (iv) the genome accessions GCF_003024615.1 (“Mesorhizobium loti LMG 6,125T”), GCF_003024595.1 (“Mesorhizobium plurifarium LMG 11,892T”), GCF_003096615.1 (“Methylobacterium organophilum DSM 760T”), and GCF_000373025.1 (“R. gallicum R-602 spT”) are not from the genuine type strains used for the respective species descriptions; and v) “Xanthobacter autotrophicus” Py2 and “Aminobacter aminovorans” KCTC 2,477T represent cases of misuse of the term “type strain”. Aminobacter heintzii comb. nov. and the reclassification of Aminobacter ciceronei as A. heintzii is also proposed. To facilitate the downstream analysis of large ANI matrices, we introduce here ProKlust (“Prokaryotic Clusters”), an R package that uses a graph-based approach to obtain, filter, and visualize clusters on identity/similarity matrices, with settable cut-off points and the possibility of multiple matrices entries.

Download Full-text