In SilicoIdentification of Functional Protein Interfaces

Proteins perform many of their biological roles through protein–protein, protein–DNA or protein–ligand interfaces. The identification of the amino acids comprising these interfaces often enhances our understanding of the biological function of the proteins. Many methods for the detection of functional interfaces have been developed, and large-scale analyses have provided assessments of their accuracy. Among them are those that consider the size of the protein interface, its amino acid composition and its physicochemical and geometrical properties. Other methods to this effect use statistical potential functions of pairwise interactions, and evolutionary information. The rationale of the evolutionary approach is that functional and structural constraints impose selective pressure; hence, biologically important interfaces often evolve at a slower pace than do other external regions of the protein. Recently, an algorithm, Rate4Site, and a web-server, ConSurf (http://consurf.tau.ac.il/), for the identification of functional interfaces based on the evolutionary relations among homologous proteins as reflected in phylogenetic trees, were developed in our laboratory. The explicit use of the tree topology and branch lengths makes the method remarkably accurate and sensitive. Here we demonstrate its potency in the identification of the functional interfaces of a hypothetical protein, the structure of which was determined as part of the international structural genomics effort. Finally, we propose to combine complementary procedures, in order to enhance the overall performance of methods for the identification of functional interfaces in proteins.

Download Full-text

TreeCluster: clustering biological sequences using phylogenetic trees

10.1101/591388 ◽

2019 ◽

Cited By ~ 3

Author(s):

Metin Balaban ◽

Niema Moshiri ◽

Uyen Mai ◽

Siavash Mirarab

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Hiv Transmission ◽

Optimization Problems ◽

Divide And Conquer ◽

Multiple Sequence ◽

Branch Lengths ◽

Computer Scientists ◽

Minimum Number ◽

Microbiome Data

AbstractClustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given a (not necessarily ultrametric) tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints that limit the diameter of each cluster, the sum of its branch lengths, or chains of pairwise distances. These three versions of the problem can be solved in time that increases linearly with the size of the tree, a fact that has been known by computer scientists for two of these three criteria for decades. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU picking for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available athttps://github.com/niemasd/TreeCluster.

Download Full-text

Faculty Opinions recommendation of Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1163036.623685 ◽

2009 ◽

Author(s):

Oliver Pybus

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Sequence Alignments

Download Full-text

Faculty Opinions recommendation of Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718397008.793524214 ◽

2016 ◽

Author(s):

Peter Uetz

Keyword(s):

Accurate Prediction ◽

Evolutionary Information ◽

Protein Interfaces

Download Full-text

The Aryl Hydrocarbon Receptor Nuclear Translocator (ARNT) Family of Proteins: Transcriptional Modifiers with Multi-Functional Protein Interfaces

Current Molecular Medicine ◽

10.2174/15665240113139990042 ◽

2013 ◽

Vol 13 (7) ◽

pp. 1047-1065 ◽

Cited By ~ 18

Author(s):

M. Labrecque ◽

G. Prefontaine ◽

T. Beischlag

Keyword(s):

Aryl Hydrocarbon Receptor ◽

Functional Protein ◽

Aryl Hydrocarbon ◽

Protein Interfaces

Download Full-text

A Phylogenomic Supertree of Birds

Diversity ◽

10.3390/d11070109 ◽

2019 ◽

Vol 11 (7) ◽

pp. 109 ◽

Cited By ~ 17

Author(s):

Rebecca T. Kimball ◽

Carl H. Oliveros ◽

Ning Wang ◽

Noor D. White ◽

F. Keith Barker ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bird Species ◽

Divide And Conquer ◽

Clear Understanding ◽

Whole Genome ◽

Efficient Manner ◽

Sequence Capture ◽

Branch Lengths ◽

Supertree Methods

It has long been appreciated that analyses of genomic data (e.g., whole genome sequencing or sequence capture) have the potential to reveal the tree of life, but it remains challenging to move from sequence data to a clear understanding of evolutionary history, in part due to the computational challenges of phylogenetic estimation using genome-scale data. Supertree methods solve that challenge because they facilitate a divide-and-conquer approach for large-scale phylogeny inference by integrating smaller subtrees in a computationally efficient manner. Here, we combined information from sequence capture and whole-genome phylogenies using supertree methods. However, the available phylogenomic trees had limited overlap so we used taxon-rich (but not phylogenomic) megaphylogenies to weave them together. This allowed us to construct a phylogenomic supertree, with support values, that included 707 bird species (~7% of avian species diversity). We estimated branch lengths using mitochondrial sequence data and we used these branch lengths to estimate divergence times. Our time-calibrated supertree supports radiation of all three major avian clades (Palaeognathae, Galloanseres, and Neoaves) near the Cretaceous-Paleogene (K-Pg) boundary. The approach we used will permit the continued addition of taxa to this supertree as new phylogenomic data are published, and it could be applied to other taxa as well.

Download Full-text

Predicting the Impact of Describing New Species on Phylogenetic Patterns

Integrative Organismal Biology ◽

10.1093/iob/obz028 ◽

2019 ◽

Vol 1 (1) ◽

Cited By ~ 1

Author(s):

D C Blackburn ◽

G Giribet ◽

D E Soltis ◽

E L Stanley

Keyword(s):

New Species ◽

Phylogenetic Trees ◽

Branch Length ◽

Length Variation ◽

Tree Shape ◽

Branch Lengths ◽

Taxonomic History ◽

Ecological Patterns ◽

The Impact ◽

Incomplete Sampling

Abstract Although our inventory of Earth’s biodiversity remains incomplete, we still require analyses using the Tree of Life to understand evolutionary and ecological patterns. Because incomplete sampling may bias our inferences, we must evaluate how future additions of newly discovered species might impact analyses performed today. We describe an approach that uses taxonomic history and phylogenetic trees to characterize the impact of past species discoveries on phylogenetic knowledge using patterns of branch-length variation, tree shape, and phylogenetic diversity. This provides a framework for assessing the relative completeness of taxonomic knowledge of lineages within a phylogeny. To demonstrate this approach, we use recent large phylogenies for amphibians, reptiles, flowering plants, and invertebrates. Well-known clades exhibit a decline in the mean and range of branch lengths that are added each year as new species are described. With increased taxonomic knowledge over time, deep lineages of well-known clades become known such that most recently described new species are added close to the tips of the tree, reflecting changing tree shape over the course of taxonomic history. The same analyses reveal other clades to be candidates for future discoveries that could dramatically impact our phylogenetic knowledge. Our work reveals that species are often added non-randomly to the phylogeny over multiyear time-scales in a predictable pattern of taxonomic maturation. Our results suggest that we can make informed predictions about how new species will be added across the phylogeny of a given clade, thus providing a framework for accommodating unsampled undescribed species in evolutionary analyses.

Download Full-text

Co-Evolution of Intrinsically Disordered Proteins with Folded Partners Witnessed by Evolutionary Couplings

International Journal of Molecular Sciences ◽

10.3390/ijms19113315 ◽

2018 ◽

Vol 19 (11) ◽

pp. 3315 ◽

Cited By ~ 10

Author(s):

Rita Pancsa ◽

Fruzsina Zsolyomi ◽

Peter Tompa

Keyword(s):

Large Scale ◽

Intrinsically Disordered Proteins ◽

Protein Structures ◽

Disordered Proteins ◽

Cellular Interaction ◽

Structural Constraints ◽

Protein Residues ◽

Intrinsically Disordered ◽

Evolutionary Changes ◽

Folded Proteins

Although improved strategies for the detection and analysis of evolutionary couplings (ECs) between protein residues already enable the prediction of protein structures and interactions, they are mostly restricted to conserved and well-folded proteins. Whereas intrinsically disordered proteins (IDPs) are central to cellular interaction networks, due to the lack of strict structural constraints, they undergo faster evolutionary changes than folded domains. This makes the reliable identification and alignment of IDP homologs difficult, which led to IDPs being omitted in most large-scale residue co-variation analyses. By preforming a dedicated analysis of phylogenetically widespread bacterial IDP–partner interactions, here we demonstrate that partner binding imposes constraints on IDP sequences that manifest in detectable interprotein ECs. These ECs were not detected for interactions mediated by short motifs, rather for those with larger IDP–partner interfaces. Most identified coupled residue pairs reside close (<10 Å) to each other on the interface, with a third of them forming multiple direct atomic contacts. EC-carrying interfaces of IDPs are enriched in negatively charged residues, and the EC residues of both IDPs and partners preferentially reside in helices. Our analysis brings hope that IDP–partner interactions difficult to study could soon be successfully dissected through residue co-variation analysis.

Download Full-text

GeneRax: A tool for species tree-aware maximum likelihood based gene family tree inference under gene duplication, transfer, and loss

10.1101/779066 ◽

2019 ◽

Cited By ~ 3

Author(s):

Benoit Morel ◽

Alexey M. Kozlov ◽

Alexandros Stamatakis ◽

Gergely J. Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.

Download Full-text

Endogenous miR-21 and TIMP-1 Regulate Hepatic Injury and Fibrosis by Bile Duct Ligation in Vivo

10.21203/rs.3.rs-338885/v1 ◽

2021 ◽

Author(s):

Chung-Hsin Lee ◽

Yi-Chin Yang ◽

Yi-Wen Hung ◽

Ching-Chang Cheng ◽

Yen-Chung Peng

Keyword(s):

Bile Duct ◽

Large Scale ◽

Bile Duct Ligation ◽

Inflammatory Responses ◽

Histopathological Examination ◽

Sprague Dawley ◽

Functional Protein ◽

Common Bile Duct Ligation ◽

Simvastatin Treatment ◽

Duct Ligation

Abstract TIMP metallopeptidase inhibitor 1 (TIMP-1) has been identified as a multifunctional molecule with divergent functions. It participates in wound healing and regeneration, cell morphology and survival, tumor metastasis, angiogenesis, and inflammatory responses. An imbalance of Matrix Metalloproteinase/TIMP regulation has been implicated in several inflammatory diseases. TIMP-1 could be considered an important regulator in the process of liver fibrosis and bile duct degeneration. Thus, we aimed to determine the role of TIMP-1 in a rat model of Common Bile Duct Ligation (CBDL). Male Sprague-Dawley rats were divided into several groups, including those with/ without CBDL surgery and those with/without amiodarone or simvastatin administration. Amiodarone/simvastatin treatment was given at a daily dose of 15 mg/kg and 18 mg/kg by means of intergalactic gavage, which began 7 days prior to CBDL induction. Two weeks after surgery, the animals in each group were sacrificed and hepatocyte degeneration severity was examined using histological morphologies. Large-scale array for secretory factors is intended for the purpose of finding key functional protein after CBDL. The hepatic level of miR-21 was determined through Taqman miRNA analysis. Furthermore, the TIMP-1 level in liver tissue was also visualized by histological stain. Liver injury and fibrosis were founded in CBDL rats based upon histopathological examination and serum biochemical analysis. Hepatic miR-21 and TIMP-1 were significantly up-regulated in CBDL rats, while being slightly rescued in response to amiodarone or simvastatin treatment. Up-regulation of miR-21 and TIMP-1 may result in the progression of hepatic cirrhosis after bile duct obstruction. Drug intervention for cirrhosis, like the use of statin, may function via similar mechanisms.

Download Full-text

Molecular dynamics simulations for genetic interpretation in protein coding regions: where we are, where to go and when

Briefings in Bioinformatics ◽

10.1093/bib/bbz146 ◽

2019 ◽

Author(s):

Juan J Galano-Frutos ◽

Helena García-Cebollada ◽

Javier Sancho

Keyword(s):

Molecular Dynamics ◽

Amino Acid ◽

Large Scale ◽

Binary Classification ◽

Chemical Properties ◽

Md Simulations ◽

Single Amino Acid ◽

Medical Decision ◽

Evolutionary Information ◽

Protein Variants

Abstract The increasing ease with which massive genetic information can be obtained from patients or healthy individuals has stimulated the development of interpretive bioinformatics tools as aids in clinical practice. Most such tools analyze evolutionary information and simple physical–chemical properties to predict whether replacement of one amino acid residue with another will be tolerated or cause disease. Those approaches achieve up to 80–85% accuracy as binary classifiers (neutral/pathogenic). As such accuracy is insufficient for medical decision to be based on, and it does not appear to be increasing, more precise methods, such as full-atom molecular dynamics (MD) simulations in explicit solvent, are also discussed. Then, to describe the goal of interpreting human genetic variations at large scale through MD simulations, we restrictively refer to all possible protein variants carrying single-amino-acid substitutions arising from single-nucleotide variations as the human variome. We calculate its size and develop a simple model that allows calculating the simulation time needed to have a 0.99 probability of observing unfolding events of any unstable variant. The knowledge of that time enables performing a binary classification of the variants (stable-potentially neutral/unstable-pathogenic). Our model indicates that the human variome cannot be simulated with present computing capabilities. However, if they continue to increase as per Moore’s law, it could be simulated (at 65°C) spending only 3 years in the task if we started in 2031. The simulation of individual protein variomes is achievable in short times starting at present. International coordination seems appropriate to embark upon massive MD simulations of protein variants.

Download Full-text