Family-Specific Gains and Losses of Protein Domains in the Legume and Grass Plant Families

Protein domains can be regarded as sections of protein sequences capable of folding independently and performing specific functions. In addition to amino-acid level changes, protein sequences can also evolve through domain shuffling events such as domain insertion, deletion, or duplication. The evolution of protein domains can be studied by tracking domain changes in a selected set of species with known phylogenetic relationships. Here, we conduct such an analysis by defining domains as “features” or “descriptors,” and considering the species (target + outgroup) as instances or data-points in a data matrix. We then look for features (domains) that are significantly different between the target species and the outgroup species. We study the domain changes in 2 large, distinct groups of plant species: legumes (Fabaceae) and grasses (Poaceae), with respect to selected outgroup species. We evaluate 4 types of domain feature matrices: domain content, domain duplication, domain abundance, and domain versatility. The 4 types of domain feature matrices attempt to capture different aspects of domain changes through which the protein sequences may evolve—that is, via gain or loss of domains, increase or decrease in the copy number of domains along the sequences, expansion or contraction of domains, or through changes in the number of adjacent domain partners. All the feature matrices were analyzed using feature selection techniques and statistical tests to select protein domains that have significant different feature values in legumes and grasses. We report the biological functions of the top selected domains from the analysis of all the feature matrices. In addition, we also perform domain-centric gene ontology (dcGO) enrichment analysis on all selected domains from all 4 feature matrices to study the gene ontology terms associated with the significantly evolving domains in legumes and grasses. Domain content analysis revealed a striking loss of protein domains from the Fanconi anemia (FA) pathway, the pathway responsible for the repair of interstrand DNA crosslinks. The abundance analysis of domains found in legumes revealed an increase in glutathione synthase enzyme, an antioxidant required from nitrogen fixation, and a decrease in xanthine oxidizing enzymes, a phenomenon confirmed by previous studies. In grasses, the abundance analysis showed increases in domains related to gene silencing which could be due to polyploidy or due to enhanced response to viral infection. We provide a docker container that can be used to perform this analysis workflow on any user-defined sets of species, available at https://cloud.docker.com/u/akshayayadav/repository/docker/akshayayadav/protein-domain-evolution-project .

Download Full-text

Computational modelling of Chromosomally Clustering Protein Domains In Bacteria

10.21203/rs.3.rs-646589/v1 ◽

2021 ◽

Author(s):

Chiara E. Cotroneo ◽

Isobel Claire Gormley ◽

Denis C. Shields ◽

Michael Salter-Townshend

Keyword(s):

Gene Ontology ◽

Mixture Model ◽

Computational Modelling ◽

Protein Domains ◽

Gene Clusters ◽

Protein Domain ◽

Parameter Estimates ◽

Bacterial Gene ◽

Wide Range ◽

The Stability

Abstract Background: In bacteria, genes with related functions - such as those involved in the metabolism of the same compound or in infection processes - are often physically close on the genome and form groups called clusters. The enrichment of such clusters over various distantly related bacteria can be used to predict the roles of genes of unknown function that cluster with characterised genes. There is no obvious rule to define a cluster, given their variability in size and intergenic distances, and the definition of what comprises a “gene”, since genes can gain and lose domains over time. Protein domains can cluster within a gene, or in adjacent genes of related function, and in both cases these are chromosomally clustered. Here, we model the distances between pairs of protein domain coding regions across a wide range of bacteria and archaea via a probabilistic two component mixture model, without imposing arbitrary thresholds in terms of gene numbers or distances. Results: We trained our model using matched Gene Ontology terms to label functionally related pairs and assess the stability of the parameters of the model across 14, 178 archaeal and bacterial strains. We found that the parameters of our mixture model are remarkably stable across bacteria and archaea, except for endosymbionts and obligate intracellular pathogens. Obligate pathogens have smaller genomes, and although they vary, on average do not show noticeably different clustering distances; the main difference in the parameter estimates is that a far greater proportion of the genes sharing ontology terms are clustered. This may reflect that these genomes are enriched for complexes encoded by clustered core housekeeping genes, as a proportion of the total genes. Given the overall stability of the parameter estimates, we then used the mean parameter estimates across the entire dataset to investigate which gene ontology terms are most frequently associated with clustered genes. Conclusions: Given the stability of the mixture model across species, it may be used to predict bacterial gene clusters that are shared across multiple species, in addition to giving insights into the evolutionary pressures on the chromosomal locations of genes in different species.

Download Full-text

Computational modelling of chromosomally clustering protein domains in bacteria

BMC Bioinformatics ◽

10.1186/s12859-021-04512-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Chiara E. Cotroneo ◽

Isobel Claire Gormley ◽

Denis C. Shields ◽

Michael Salter-Townshend

Keyword(s):

Gene Ontology ◽

Mixture Model ◽

Computational Modelling ◽

Protein Domains ◽

Gene Clusters ◽

Protein Domain ◽

Parameter Estimates ◽

Bacterial Gene ◽

Wide Range ◽

The Stability

Abstract Background In bacteria, genes with related functions—such as those involved in the metabolism of the same compound or in infection processes—are often physically close on the genome and form groups called clusters. The enrichment of such clusters over various distantly related bacteria can be used to predict the roles of genes of unknown function that cluster with characterised genes. There is no obvious rule to define a cluster, given their variability in size and intergenic distances, and the definition of what comprises a “gene”, since genes can gain and lose domains over time. Protein domains can cluster within a gene, or in adjacent genes of related function, and in both cases these are chromosomally clustered. Here, we model the distances between pairs of protein domain coding regions across a wide range of bacteria and archaea via a probabilistic two component mixture model, without imposing arbitrary thresholds in terms of gene numbers or distances. Results We trained our model using matched gene ontology terms to label functionally related pairs and assess the stability of the parameters of the model across 14,178 archaeal and bacterial strains. We found that the parameters of our mixture model are remarkably stable across bacteria and archaea, except for endosymbionts and obligate intracellular pathogens. Obligate pathogens have smaller genomes, and although they vary, on average do not show noticeably different clustering distances; the main difference in the parameter estimates is that a far greater proportion of the genes sharing ontology terms are clustered. This may reflect that these genomes are enriched for complexes encoded by clustered core housekeeping genes, as a proportion of the total genes. Given the overall stability of the parameter estimates, we then used the mean parameter estimates across the entire dataset to investigate which gene ontology terms are most frequently associated with clustered genes. Conclusions Given the stability of the mixture model across species, it may be used to predict bacterial gene clusters that are shared across multiple species, in addition to giving insights into the evolutionary pressures on the chromosomal locations of genes in different species.

Download Full-text

PA-GOSUB: a searchable database of model organism protein sequences with their predicted Gene Ontology molecular function and subcellular localization

Nucleic Acids Research ◽

10.1093/nar/gki120 ◽

2004 ◽

Vol 33 (Database issue) ◽

pp. D147-D153 ◽

Cited By ~ 10

Author(s):

P. Lu

Keyword(s):

Gene Ontology ◽

Subcellular Localization ◽

Model Organism ◽

Protein Sequences ◽

Molecular Function ◽

Searchable Database

Download Full-text

Establishing a consensus for the hallmarks of cancer based on gene ontology and pathway annotations

BMC Bioinformatics ◽

10.1186/s12859-021-04105-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yi Chen ◽

Fons. J. Verbeek ◽

Katherine Wolstencroft

Keyword(s):

Gene Ontology ◽

Enrichment Analysis ◽

Biological Data ◽

Hallmarks Of Cancer ◽

High Throughput Analysis ◽

Knowledge Resources ◽

Gene Set ◽

Cancer Hallmarks ◽

Starting Point ◽

High Level

Abstract Background The hallmarks of cancer provide a highly cited and well-used conceptual framework for describing the processes involved in cancer cell development and tumourigenesis. However, methods for translating these high-level concepts into data-level associations between hallmarks and genes (for high throughput analysis), vary widely between studies. The examination of different strategies to associate and map cancer hallmarks reveals significant differences, but also consensus. Results Here we present the results of a comparative analysis of cancer hallmark mapping strategies, based on Gene Ontology and biological pathway annotation, from different studies. By analysing the semantic similarity between annotations, and the resulting gene set overlap, we identify emerging consensus knowledge. In addition, we analyse the differences between hallmark and gene set associations using Weighted Gene Co-expression Network Analysis and enrichment analysis. Conclusions Reaching a community-wide consensus on how to identify cancer hallmark activity from research data would enable more systematic data integration and comparison between studies. These results highlight the current state of the consensus and offer a starting point for further convergence. In addition, we show how a lack of consensus can lead to large differences in the biological interpretation of downstream analyses and discuss the challenges of annotating changing and accumulating biological data, using intermediate knowledge resources that are also changing over time.

Download Full-text

Proteomic analysis of the pulvinus, a heliotropic tissue, in Glycine max

International Journal of Plant Biology ◽

10.4081/pb.2014.4887 ◽

2014 ◽

Vol 5 (1) ◽

Cited By ~ 3

Author(s):

Hakme Lee ◽

Wesley M. Garrett ◽

Joseph Sullivan ◽

Irwin Forseth ◽

Savithiry S. Natarajan

Keyword(s):

Mass Spectrometry ◽

Gene Ontology ◽

Gel Electrophoresis ◽

Proton Transport ◽

Enrichment Analysis ◽

The Sun ◽

Leaf Movement ◽

Two Dimensional ◽

Two Dimensional Gel Electrophoresis ◽

Leguminous Plants

Certain plant species respond to light, dark, and other environmental factors by leaf movement. Leguminous plants both track and avoid the sun through turgor changes of the pulvinus tissue at the base of leaves. Mechanisms leading to pulvinar turgor flux, particularly knowledge of the proteins involved, are not well-known. In this study we used two-dimensional gel electrophoresis and liquid chromatography-tandom mass spectrometry to separate and identify the proteins located in the soybean pulvinus. A total of 183 spots were separated and 195 proteins from 165 spots were identified and functionally analyzed using single enrichment analysis for gene ontology terms. The most significant terms were related to proton transport. Comparison with guard cell proteomes revealed similar significant processes but a greater number of pulvinus proteins are required for comparable analysis. To our knowledge, this is a novel report on the analysis of proteins found in soybean pulvinus. These findings provide a better understanding of the proteins required for turgor change in the pulvinus.

Download Full-text

Basics of Statistics for Clinical Research in Hand Surgery

Revista Iberoamericana de Cirugía de la Mano ◽

10.1055/s-0038-1675587 ◽

2018 ◽

Vol 46 (02) ◽

pp. 150-171 ◽

Cited By ~ 1

Author(s):

Roberto Rosales ◽

Isam Atroshi

Keyword(s):

Clinical Research ◽

Linear Models ◽

Surgical Intervention ◽

Statistical Tests ◽

Hand Surgery ◽

Descriptive Statistics ◽

Data Matrix ◽

P Value ◽

Real Value ◽

Tunnel Syndrome

AbstractStatistics, the science of numerical evaluation, helps in determining the real value of a hand surgical intervention. Clinical research in hand surgery cannot improve without considering the application of the most appropriate statistical procedures. The purpose of the present paper is to approach the basics of data analysis using a database of carpal tunnel syndrome (CTS) to understand the data matrix, the generation of variables, the descriptive statistics, the most appropriate statistical tests based on how data were collected, the parameter estimation (inference statistics) with p-value or confidence interval, and, finally, the important concept of generalized linear models (GLMs) or regression analysis.

Download Full-text

Effects of Pseudorabies Virus Infection on the Tracheobronchial Lymph Node Transcriptome

Bioinformatics and Biology Insights ◽

10.4137/bbi.s30522 ◽

2015 ◽

Vol 9s2 ◽

pp. BBI.S30522 ◽

Cited By ~ 2

Author(s):

Laura C. Miller ◽

Darrell O. Bayles ◽

Eraldo L. Zanella ◽

Kelly M. Lager

Keyword(s):

Pseudorabies Virus ◽

Statistical Tests ◽

Digital Gene Expression ◽

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Post Inoculation ◽

Gene Set Enrichment ◽

Tracheobronchial Lymph Node ◽

Base Sequences ◽

Tracheobronchial Lymph Nodes

This study represents the first swine transcriptome hive plots created from gene set enrichment analysis (GSEA) data and provides a novel insight into the global transcriptome changes occurring in tracheobronchial lymph nodes (TBLN) and spanning the swine genome. RNA isolated from draining TBLN from 5-week-old pigs, either clinically infected with a feral isolate of Pseudorabies virus or uninfected, was interrogated using Illumina Digital Gene Expression Tag Profiling. More than 100 million tag sequences were observed, representing 4,064,189 unique 21-base sequences collected from TBLN at time points 1, 3, 6, and 14 days post-inoculation (dpi). Multidimensional statistical tests were applied to determine the significant changes in tag abundance, and then the tags were annotated. Hive plots were created to visualize the differential expression within the swine transcriptome defined by the Broad Institute's GSEA reference datasets between infected and uninfected animals, allowing us to directly compare different conditions.

Download Full-text

EnTAP: Bringing Faster and Smarter Functional Annotation to Non-Model Eukaryotic Transcriptomes

10.1101/307868 ◽

2018 ◽

Cited By ~ 5

Author(s):

Alexander J. Hart ◽

Samuel Ginzburg ◽

Muyang (Sam) Xu ◽

Cera R. Fisher ◽

Nasim Rahmatpour ◽

...

Keyword(s):

Similarity Search ◽

De Novo ◽

Gene Annotation ◽

Enrichment Analysis ◽

Orthologous Gene ◽

Protein Domain ◽

Family Assessment ◽

Ontology Term ◽

Protein Coding ◽

Functional Gene Annotation

ABSTRACTEnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates, while focusing primarily on protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins. Downstream features include fast similarity search across three repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

Download Full-text

Exploration of the Proteomic Landscape of Small Extracellular Vesicles in Serum as Biomarkers for Early Detection of Colorectal Neoplasia

Frontiers in Oncology ◽

10.3389/fonc.2021.732743 ◽

2021 ◽

Vol 11 ◽

Author(s):

Li-Chun Chang ◽

Yi-Chiung Hsu ◽

Han-Mo Chiu ◽

Koji Ueda ◽

Ming-Shiang Wu ◽

...

Keyword(s):

Extracellular Vesicles ◽

Normal Control ◽

Early Stage ◽

Colorectal Neoplasia ◽

Enrichment Analysis ◽

Functional Enrichment Analysis ◽

Functional Enrichment ◽

Data Matrix ◽

Advanced Adenoma ◽

Advanced Neoplasia

BackgroundPatient participation in colorectal cancer (CRC) screening via a stool test and colonoscopy is suboptimal, but participation can be improved by the development of a blood test. However, the suboptimal detection abilities of blood tests for advanced neoplasia, including advanced adenoma (AA) and CRC, limit their application. We aimed to investigate the proteomic landscape of small extracellular vesicles (sEVs) from the serum of patients with colorectal neoplasia and identify specific sEV proteins that could serve as biomarkers for early diagnosis.Materials and MethodsWe enrolled 100 patients including 13 healthy subjects, 12 non-AAs, 13 AAs, and 16 stage-I, 15 stage-II, 16 stage-III, and 15 stage-IV CRCs. These patients were classified as normal control, early neoplasia, and advanced neoplasia. The sEV proteome was explored by liquid chromatography-tandem mass spectrometry. Generalized association plots were used to integrate the clustering methods, visualize the data matrix, and analyze the relationship. The specific sEV biomarkers were identified by a decision tree via Orange3 software. Functional enrichment analysis was conducted by using the Ingenuity Pathway Analysis platform.ResultsThe sEV protein matrix was identified from the serum of 100 patients and contained 3353 proteins, of which 1921 proteins from 98 patients were finally analyzed. Compared with the normal control, subjects with early and advanced neoplasia exhibited a distinct proteomic distribution in the data matrix plot. Six sEV proteins were identified, namely, GCLM, KEL, APOF, CFB, PDE5A, and ATIC, which properly distinguished normal control, early neoplasia, and advanced neoplasia patients from each other. Functional enrichment analysis revealed that APOF+ and CFB+ sEV associated with clathrin-mediated endocytosis signaling and the complement system, which have critical implications for CRC carcinogenesis.ConclusionPatients with colorectal neoplasia had a distinct sEV proteome expression pattern in serum compared with those patients who were healthy and did not have neoplasms. Moreover, the six identified specific sEV proteins had the potential to discriminate colorectal neoplasia between early-stage and advanced neoplasia. Collectively, our study provided a six-sEV protein biomarker panel for CRC diagnosis at early or advanced stages. Furthermore, the implication of the sEV proteome in CRC carcinogenesis via specific signaling pathways was explored.

Download Full-text

Basing population genetic inferences and models of molecular evolution upon desired stationary distributions of DNA or protein sequences

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2008.0167 ◽

2008 ◽

Vol 363 (1512) ◽

pp. 3931-3939 ◽

Cited By ~ 10

Author(s):

Sang Chul Choi ◽

Benjamin D Redelings ◽

Jeffrey L Thorne

Keyword(s):

Molecular Evolution ◽

Markov Model ◽

Stationary Distribution ◽

Population Genetic ◽

Amino Acid Level ◽

Protein Sequences ◽

Relative Fitness ◽

Evolutionary Models ◽

Effective Population ◽

Stationary Distributions

Models of molecular evolution tend to be overly simplistic caricatures of biology that are prone to assigning high probabilities to biologically implausible DNA or protein sequences. Here, we explore how to construct time-reversible evolutionary models that yield stationary distributions of sequences that match given target distributions. By adopting comparatively realistic target distributions, evolutionary models can be improved. Instead of focusing on estimating parameters, we concentrate on the population genetic implications of these models. Specifically, we obtain estimates of the product of effective population size and relative fitness difference of alleles. The approach is illustrated with two applications to protein-coding DNA. In the first, a codon-based evolutionary model yields a stationary distribution of sequences, which, when the sequences are translated, matches a variable-length Markov model trained on human proteins. In the second, we introduce an insertion–deletion model that describes selectively neutral evolutionary changes to DNA. We then show how to modify the neutral model so that its stationary distribution at the amino acid level can match a profile hidden Markov model, such as the one associated with the Pfam database.

Download Full-text