A new sequence logo plot to highlight enrichment and depletion

Mapping Intimacies ◽

10.1101/226597 ◽

2017 ◽

Cited By ~ 1

Author(s):

Kushal K. Dey ◽

Dongyue Xie ◽

Matthew Stephens

Keyword(s):

Protein Sequences ◽

R Package ◽

Sequence Logo ◽

Sequence Motifs ◽

Sequence Alignments ◽

Cancer Mutation ◽

Single Character ◽

Graphical Tool ◽

Wide Range ◽

Visual Clutter

AbstractBackgroundSequence logo plots have become a standard graphical tool for visualizing sequence motifs in DNA, RNA or protein sequences. However standard logo plots primarily highlight enrichment of symbols, and may fail to highlight interesting depletions. Current alternatives that try to highlight depletion often produce visually cluttered logos.ResultsWe introduce a new sequence logo plot, the EDLogo plot, that highlights both enrichment and depletion, while minimizing visual clutter. We provide an easy-to-use and highly customizable R package Logolas to produce a range of logo plots, including EDLogo plots. This software also allows elements in the logo plot to be strings of characters, rather than a single character, extending the range of applications beyond the usual DNA, RNA or protein sequences. We illustrate our methods and software on applications to transcription factor binding site motifs, protein sequence alignments and cancer mutation signature profiles.ConclusionOur new EDLogo plots, and flexible software implementation, can help data analysts visualize both enrichment and depletion of characters (DNA sequence bases, amino acids, etc) across a wide range of applications.

BloodGen3Module: Blood transcriptional module repertoire analysis and visualization using R

Bioinformatics ◽

10.1093/bioinformatics/btab121 ◽

2021 ◽

Author(s):

Darawan Rinchai ◽

Jessica Roelands ◽

Mohammed Toufiq ◽

Wouter Hendrickx ◽

Matthew C Altman ◽

...

Keyword(s):

Transcript Abundance ◽

R Package ◽

Supplementary Information ◽

Illustrative Case ◽

Bioinformatic Tools ◽

Transcriptional Module ◽

Wide Range ◽

Downstream Analysis ◽

Computing Module ◽

Parallel Workflow

Abstract Motivation We previously described the construction and characterization of generic and reusable blood transcriptional module repertoires. More recently we released a third iteration (“BloodGen3” module repertoire) that comprises 382 functionally annotated gene sets (modules) and encompasses 14,168 transcripts. Custom bioinformatic tools are needed to support downstream analysis, visualization and interpretation relying on such fixed module repertoires. Results We have developed and describe here a R package, BloodGen3Module. The functions of our package permit group comparison analyses to be performed at the module-level, and to display the results as annotated fingerprint grid plots. A parallel workflow for computing module repertoire changes for individual samples rather than groups of samples is also available; these results are displayed as fingerprint heatmaps. An illustrative case is used to demonstrate the steps involved in generating blood transcriptome repertoire fingerprints of septic patients. Taken together, this resource could facilitate the analysis and interpretation of changes in blood transcript abundance observed across a wide range of pathological and physiological states. Availability The BloodGen3Module package and documentation are freely available from Github: https://github.com/Drinchai/BloodGen3Module Supplementary information Supplementary data are available at Bioinformatics online.

Whole genome-based characterisation of antimicrobial resistance and genetic diversity in Campylobacter jejuni and Campylobacter coli from ruminants

Scientific Reports ◽

10.1038/s41598-021-88318-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Medelin Ocejo ◽

Beatriz Oporto ◽

José Luis Lavín ◽

Ana Hurtado

Keyword(s):

Genetic Diversity ◽

Antimicrobial Resistance ◽

Campylobacter Jejuni ◽

Resistance Genes ◽

Campylobacter Coli ◽

Whole Genome ◽

Northern Spain ◽

Sequence Alignments ◽

Phenotypic Data ◽

Wide Range

AbstractCampylobacter, a leading cause of gastroenteritis in humans, asymptomatically colonises the intestinal tract of a wide range of animals.Although antimicrobial treatment is restricted to severe cases, the increase of antimicrobial resistance (AMR) is a concern. Considering the significant contribution of ruminants as reservoirs of resistant Campylobacter, Illumina whole-genome sequencing was used to characterise the mechanisms of AMR in Campylobacter jejuni and Campylobacter coli recovered from beef cattle, dairy cattle, and sheep in northern Spain. Genome analysis showed extensive genetic diversity that clearly separated both species. Resistance genotypes were identified by screening assembled sequences with BLASTn and ABRicate, and additional sequence alignments were performed to search for frameshift mutations and gene modifications. A high correlation was observed between phenotypic resistance to a given antimicrobial and the presence of the corresponding known resistance genes. Detailed sequence analysis allowed us to detect the recently described mosaic tet(O/M/O) gene in one C. coli, describe possible new alleles of blaOXA-61-like genes, and decipher the genetic context of aminoglycoside resistance genes, as well as the plasmid/chromosomal location of the different AMR genes and their implication for resistance spread. Updated resistance gene databases and detailed analysis of the matched open reading frames are needed to avoid errors when using WGS-based analysis pipelines for AMR detection in the absence of phenotypic data.

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

An open-source R-package and web application for high-quality probabilistic predictions in hydrology

10.5194/egusphere-egu21-8549 ◽

2021 ◽

Author(s):

Jason Hunter ◽

Mark Thyer ◽

Dmitri Kavetski ◽

David McInerney

Keyword(s):

Open Source ◽

Web Application ◽

R Package ◽

Error Model ◽

Objective Functions ◽

High Quality ◽

Wide Range ◽

Probabilistic Error

Probabilistic predictions provide crucial information regarding the uncertainty of hydrological predictions, which are a key input for risk-based decision-making. However, they are often excluded from hydrological modelling applications because suitable probabilistic error models can be both challenging to construct and interpret, and the quality of results are often reliant on the objective function used to calibrate the hydrological model.We present an open-source R-package and an online web application that achieves the following two aims. Firstly, these resources are easy-to-use and accessible, so that users need not have specialised knowledge in probabilistic modelling to apply them. Secondly, the probabilistic error model that we describe provides high-quality probabilistic predictions for a wide range of commonly-used hydrological objective functions, which it is only able to do by including a new innovation that resolves a long-standing issue relating to model assumptions that previously prevented this broad application. &#160;We demonstrate our methods by comparing our new probabilistic error model with an existing reference error model in an empirical case study that uses 54 perennial Australian catchments, the hydrological model GR4J, 8 common objective functions and 4 performance metrics (reliability, precision, volumetric bias and errors in the flow duration curve). The existing reference error model introduces additional flow dependencies into the residual error structure when it is used with most of the study objective functions, which in turn leads to poor-quality probabilistic predictions. In contrast, the new probabilistic error model achieves high-quality probabilistic predictions for all objective functions used in this case study.The new probabilistic error model and the open-source software and web application aims to facilitate the adoption of probabilistic predictions in the hydrological modelling community, and to improve the quality of predictions and decisions that are made using those predictions. In particular, our methods can be used to achieve high-quality probabilistic predictions from hydrological models that are calibrated with a wide range of common objective functions.

Origins and Evolution of the Global RNA Virome

mBio ◽

10.1128/mbio.02329-18 ◽

2018 ◽

Vol 9 (6) ◽

Cited By ~ 121

Author(s):

Yuri I. Wolf ◽

Darius Kazlauskas ◽

Jaime Iranzo ◽

Adriana Lucía-Sanz ◽

Jens H. Kuhn ◽

...

Keyword(s):

Plant Pathogens ◽

Rna Viruses ◽

Rna Virus ◽

Phylogenomic Analysis ◽

Rna Helicases ◽

Sequence Alignments ◽

Recent Advances ◽

Wide Range ◽

Positive Sense ◽

Dsrna Viruses

ABSTRACTViruses with RNA genomes dominate the eukaryotic virome, reaching enormous diversity in animals and plants. The recent advances of metaviromics prompted us to perform a detailed phylogenomic reconstruction of the evolution of the dramatically expanded global RNA virome. The only universal gene among RNA viruses is the gene encoding the RNA-dependent RNA polymerase (RdRp). We developed an iterative computational procedure that alternates the RdRp phylogenetic tree construction with refinement of the underlying multiple-sequence alignments. The resulting tree encompasses 4,617 RNA virus RdRps and consists of 5 major branches; 2 of the branches include positive-sense RNA viruses, 1 is a mix of positive-sense (+) RNA and double-stranded RNA (dsRNA) viruses, and 2 consist of dsRNA and negative-sense (−) RNA viruses, respectively. This tree topology implies that dsRNA viruses evolved from +RNA viruses on at least two independent occasions, whereas −RNA viruses evolved from dsRNA viruses. Reconstruction of RNA virus evolution using the RdRp tree as the scaffold suggests that the last common ancestors of the major branches of +RNA viruses encoded only the RdRp and a single jelly-roll capsid protein. Subsequent evolution involved independent capture of additional genes, in particular, those encoding distinct RNA helicases, enabling replication of larger RNA genomes and facilitating virus genome expression and virus-host interactions. Phylogenomic analysis reveals extensive gene module exchange among diverse viruses and horizontal virus transfer between distantly related hosts. Although the network of evolutionary relationships within the RNA virome is bound to further expand, the present results call for a thorough reevaluation of the RNA virus taxonomy.IMPORTANCEThe majority of the diverse viruses infecting eukaryotes have RNA genomes, including numerous human, animal, and plant pathogens. Recent advances of metagenomics have led to the discovery of many new groups of RNA viruses in a wide range of hosts. These findings enable a far more complete reconstruction of the evolution of RNA viruses than was attainable previously. This reconstruction reveals the relationships between different Baltimore classes of viruses and indicates extensive transfer of viruses between distantly related hosts, such as plants and animals. These results call for a major revision of the existing taxonomy of RNA viruses.

Metallocluster transactions: dynamic protein interactions guide the biosynthesis of Fe–S clusters in bacteria

Biochemical Society Transactions ◽

10.1042/bst20180365 ◽

2018 ◽

Vol 46 (6) ◽

pp. 1593-1603 ◽

Cited By ~ 9

Author(s):

Chenkang Zheng ◽

Patricia C. Dos Santos

Keyword(s):

Protein Interactions ◽

Protein Complexes ◽

Specific Protein ◽

Biological Synthesis ◽

Cluster Assembly ◽

Sequence Motifs ◽

Protein Protein Interactions ◽

Cysteine Desulfurase ◽

Wide Range ◽

Domains Of Life

Iron–sulfur (Fe–S) clusters are ubiquitous cofactors present in all domains of life. The chemistries catalyzed by these inorganic cofactors are diverse and their associated enzymes are involved in many cellular processes. Despite the wide range of structures reported for Fe–S clusters inserted into proteins, the biological synthesis of all Fe–S clusters starts with the assembly of simple units of 2Fe–2S and 4Fe–4S clusters. Several systems have been associated with the formation of Fe–S clusters in bacteria with varying phylogenetic origins and number of biosynthetic and regulatory components. All systems, however, construct Fe–S clusters through a similar biosynthetic scheme involving three main steps: (1) sulfur activation by a cysteine desulfurase, (2) cluster assembly by a scaffold protein, and (3) guided delivery of Fe–S units to either final acceptors or biosynthetic enzymes involved in the formation of complex metalloclusters. Another unifying feature on the biological formation of Fe–S clusters in bacteria is that these systems are tightly regulated by a network of protein interactions. Thus, the formation of transient protein complexes among biosynthetic components allows for the direct transfer of reactive sulfur and Fe–S intermediates preventing oxygen damage and reactions with non-physiological targets. Recent studies revealed the importance of reciprocal signature sequence motifs that enable specific protein–protein interactions and consequently guide the transactions between physiological donors and acceptors. Such findings provide insights into strategies used by bacteria to regulate the flow of reactive intermediates and provide protein barcodes to uncover yet-unidentified cellular components involved in Fe–S metabolism.

Benchmarking Statistical Multiple Sequence Alignment

10.1101/304659 ◽

2018 ◽

Cited By ~ 1

Author(s):

Michael Nute ◽

Ehsan Saleh ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structural Alignment ◽

Estimation Method ◽

Simulated Data ◽

Protein Sequences ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Simulated Data Sets

AbstractThe estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

A test statistic to quantify treelikeness in phylogenetics

10.1101/2021.02.16.431544 ◽

2021 ◽

Author(s):

Caitlin Cherryh ◽

Bui Quang Minh ◽

Rob Lanfear

Keyword(s):

Evolutionary History ◽

Incomplete Lineage Sorting ◽

Phylogenetic Analyses ◽

Phylogenetic Network ◽

Parametric Bootstrap ◽

Lineage Sorting ◽

Test Statistic ◽

Sequence Alignments ◽

Wide Range ◽

History Of

AbstractMost phylogenetic analyses assume that the evolutionary history of an alignment (either that of a single locus, or of multiple concatenated loci) can be described by a single bifurcating tree, the so-called the treelikeness assumption. Treelikeness can be violated by biological events such as recombination, introgression, or incomplete lineage sorting, and by systematic errors in phylogenetic analyses. The incorrect assumption of treelikeness may then mislead phylogenetic inferences. To quantify and test for treelikeness in alignments, we develop a test statistic which we call the tree proportion. This statistic quantifies the proportion of the edge weights in a phylogenetic network that are represented in a bifurcating phylogenetic tree of the same alignment. We extend this statistic to a statistical test of treelikeness using a parametric bootstrap. We use extensive simulations to compare tree proportion to a range of related approaches. We show that tree proportion successfully identifies non-treelikeness in a wide range of simulation scenarios, and discuss its strengths and weaknesses compared to other approaches. The power of the tree-proportion test to reject non-treelike alignments can be lower than some other approaches, but these approaches tend to be limited in their scope and/or the ease with which they can be interpreted. Our recommendation is to test treelikeness of sequence alignments with both tree proportion and mosaic methods such as 3Seq. The scripts necessary to replicate this study are available at https://github.com/caitlinch/treelikeness

Novel pedagogical tool for simultaneous learning of plane geometry and R programming

Research Ideas and Outcomes ◽

10.3897/rio.4.e25485 ◽

2018 ◽

Vol 4 ◽

pp. e25485 ◽

Cited By ~ 1

Author(s):

Álvaro Briz-Redón ◽

Ángel Serrano-Aroca

Keyword(s):

Programming Language ◽

Undergraduate Students ◽

R Package ◽

Plane Geometry ◽

Cost Ratio ◽

Geometric Constructions ◽

Pedagogical Tool ◽

Benefit Cost Ratio ◽

Wide Range ◽

R Programming

Programming a computer is an activity that can be very beneficial to undergraduate students in terms of improving their mental capabilities, collaborative attitudes and levels of engagement in learning. Despite the initial difficulties that typically arise when learning to program, there are several well-known strategies to overcome them, providing a very high benefit-cost ratio to most of the students. Moreover, the use of a programming language usually raises the interest of students to learn any specific concept, which has caused that many teachers around the world employ a programming language as a learning environment to treat almost every possible topic. Particularly, mathematics can be taught and learnt while using a suitable programming language. The R programming language is endowed with a wide range of capabilities that allow its use to learn different kind of concepts while programming. Therefore, complex subjects such as mathematics could be learnt with the help of this powerful programming language. In addition, since the R language provides numerous graphical functions, it could be very useful to acquire simultaneously basic plane geometry and programming knowledge at the undergraduate level. This paper describes the LearnGeom R package, a novel pedagogical tool, which contains multiple functions to learn geometry in R at different levels of difficulty, from the most basic geometric objects to high-complexity geometric constructions, while developing numerous programming skills.

Pedometric tools for classification of southwestern Amazonian soils: A quali-quantitative interpretation incorporating visible-near infrared spectroscopy

Journal of Near Infrared Spectroscopy ◽

10.1177/09670335211061854 ◽

2022 ◽

pp. 096703352110618

Author(s):

Orlando CH Tavares ◽

Tiago R Tavares ◽

Carlos R Pinheiro Junior ◽

Luciélio M da Silva ◽

Paulo GS Wadt ◽

...

Keyword(s):

Near Infrared ◽

Environmental Variability ◽

Diffuse Reflectance Spectroscopy ◽

Soil Classification ◽

R Package ◽

Chemical Components ◽

Soil Profiles ◽

Wide Range

The southwestern region of the Amazon has great environmental variability, presents a great complexity of pedoenvironments due to its rich variability of geological and geomorphological environments, as well as for being a transition region with other two Brazilian biomes. In this study, the use of pedometric tools (the Algorithms for Quantitative Pedology (AQP) R package and diffuse reflectance spectroscopy) was evaluated for the characterization of 15 soil profiles in southwestern Amazon. The AQP statistical package—which evaluates the soil in-depth based on slicing functions—indicated a wide range of variation in soil attributes, especially in the superficial horizons. In addition, the results obtained in the similarity analysis corroborated with the description of physical, chemical components and oxide contents in-depth, aiding the classification of soil profiles. The in-depth characterization of visible-near infrared spectra allowed inference of the pedogenetic processes of some profiles, setting precedents for future work aiming to establish analytical strategies for soil classification in southwestern Amazon based on spectral data.