scholarly journals DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction

Author(s):  
Daniel Munro ◽  
Mona Singh

Abstract Motivation Accurately predicting the quantitative impact of a substitution on a protein’s molecular function would be a great aid in understanding the effects of observed genetic variants across populations. While this remains a challenging task, new approaches can leverage data from the increasing numbers of comprehensive deep mutational scanning (DMS) studies that systematically mutate proteins and measure fitness. Results We introduce DeMaSk, an intuitive and interpretable method based only upon DMS datasets and sequence homologs that predicts the impact of missense mutations within any protein. DeMaSk first infers a directional amino acid substitution matrix from DMS datasets and then fits a linear model that combines these substitution scores with measures of per-position evolutionary conservation and variant frequency across homologs. Despite its simplicity, DeMaSk has state-of-the-art performance in predicting the impact of amino acid substitutions, and can easily and rapidly be applied to any protein sequence. Availability and implementation https://demask.princeton.edu generates fitness impact predictions and visualizations for any user-submitted protein sequence. Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Vol 35 (14) ◽  
pp. 2492-2494
Author(s):  
Tania Cuppens ◽  
Thomas E Ludwig ◽  
Pascal Trouvé ◽  
Emmanuelle Genin

Abstract Summary When analyzing sequence data, genetic variants are considered one by one, taking no account of whether or not they are found in the same individual. However, variant combinations might be key players in some diseases as variants that are neutral on their own can become deleterious when associated together. GEMPROT is a new analysis tool that allows, from a phased vcf file, to visualize the consequences of the genetic variants on the protein. At the level of an individual, the program shows the variants on each of the two protein sequences and the Pfam functional protein domains. When data on several individuals are available, GEMPROT lists the haplotypes found in the sample and can compare the haplotype distributions between different sub-groups of individuals. By offering a global visualization of the gene with the genetic variants present, GEMPROT makes it possible to better understand the impact of combinations of genetic variants on the protein sequence. Availability and implementation GEMPROT is freely available at https://github.com/TaniaCuppens/GEMPROT. An on-line version is also available at http://med-laennec.univ-brest.fr/GEMPROT/. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Bernardina Scafuri ◽  
Angelo Facchiano ◽  
Anna Marabotti

The prediction of the stability of a protein is a very important issue in computational biology. Indeed, missense mutations are frequently associated to a change in protein stability, leading usually to destabilization, unfolding and aggregation. However, the direct measurement of the effect of mutations on proteins' stability is often impaired by the large number of mutations that can affect a protein sequence. Therefore, predicting the impact of a mutation on this feature is of remarkable interest to infer the phenotypic effects associated to a genotypic variation. For this reason, many different predictors of the effects of mutations on protein stability have been developed during the past years, and they are available online as Web servers. In the present work, we applied several tools based on different approaches to predict the stability of three proteins involved in the different forms of the rare disease galactosemia, and we compare their different results, describing also the problems that we had to face, the solutions that we have adopted and the lessons learnt from this case study.


2018 ◽  
Author(s):  
Jeffrey I. Boucher ◽  
Troy W. Whitfield ◽  
Ann Dauphin ◽  
Gily Nachum ◽  
Carl Hollins ◽  
...  

AbstractThe evolution of HIV-1 protein sequences should be governed by a combination of factors including nucleotide mutational probabilities, the genetic code, and fitness. The impact of these factors on protein sequence evolution are interdependent, making it challenging to infer the individual contribution of each factor from phylogenetic analyses alone. We investigated the protein sequence evolution of HIV-1 by determining an experimental fitness landscape of all individual amino acid changes in protease. We compared our experimental results to the frequency of protease variants in a publicly available dataset of 32,163 sequenced isolates from drug-naïve individuals. The most common amino acids in sequenced isolates supported robust experimental fitness, indicating that the experimental fitness landscape captured key features of selection acting on protease during viral infections of hosts. Amino acid changes requiring multiple mutations from the likely ancestor were slightly less likely to support robust experimental fitness than single mutations, consistent with the genetic code favoring chemically conservative amino acid changes. Amino acids that were common in sequenced isolates were predominantly accessible by single mutations from the likely protease ancestor. Multiple mutations commonly observed in isolates were accessible by mutational walks with highly fit single mutation intermediates. Our results indicate that the prevalence of multiple base mutations in HIV-1 protease is strongly influenced by mutational sampling.


Diversity ◽  
2021 ◽  
Vol 13 (11) ◽  
pp. 555
Author(s):  
Emily L. Gordon ◽  
Rebecca T. Kimball ◽  
Edward L. Braun

Phylogenomic analyses have revolutionized the study of biodiversity, but they have revealed that estimated tree topologies can depend, at least in part, on the subset of the genome that is analyzed. For example, estimates of trees for avian orders differ if protein-coding or non-coding data are analyzed. The bird tree is a good study system because the historical signal for relationships among orders is very weak, which should permit subtle non-historical signals to be identified, while monophyly of orders is strongly corroborated, allowing identification of strong non-historical signals. Hydrophobic amino acids in mitochondrially-encoded proteins, which are expected to be found in transmembrane helices, have been hypothesized to be associated with non-historical signals. We tested this hypothesis by comparing the evolution of transmembrane helices and extramembrane segments of mitochondrial proteins from 420 bird species, sampled from most avian orders. We estimated amino acid exchangeabilities for both structural environments and assessed the performance of phylogenetic analysis using each data type. We compared those relative exchangeabilities with values calculated using a substitution matrix for transmembrane helices estimated using a variety of nuclear- and mitochondrially-encoded proteins, allowing us to compare the bird-specific mitochondrial models with a general model of transmembrane protein evolution. To complement our amino acid analyses, we examined the impact of protein structure on patterns of nucleotide evolution. Models of transmembrane and extramembrane sequence evolution for amino acids and nucleotides exhibited striking differences, but there was no evidence for strong topological data type effects. However, incorporating protein structure into analyses of mitochondrially-encoded proteins improved model fit. Thus, we believe that considering protein structure will improve analyses of mitogenomic data, both in birds and in other taxa.


2018 ◽  
Vol 35 (14) ◽  
pp. 2418-2426 ◽  
Author(s):  
David Simoncini ◽  
Kam Y J Zhang ◽  
Thomas Schiex ◽  
Sophie Barbe

Abstract Motivation Structure-based Computational Protein design (CPD) plays a critical role in advancing the field of protein engineering. Using an all-atom energy function, CPD tries to identify amino acid sequences that fold into a target structure and ultimately perform a desired function. Energy functions remain however imperfect and injecting relevant information from known structures in the design process should lead to improved designs. Results We introduce Shades, a data-driven CPD method that exploits local structural environments in known protein structures together with energy to guide sequence design, while sampling side-chain and backbone conformations to accommodate mutations. Shades (Structural Homology Algorithm for protein DESign), is based on customized libraries of non-contiguous in-contact amino acid residue motifs. We have tested Shades on a public benchmark of 40 proteins selected from different protein families. When excluding homologous proteins, Shades achieved a protein sequence recovery of 30% and a protein sequence similarity of 46% on average, compared with the PFAM protein family of the target protein. When homologous structures were added, the wild-type sequence recovery rate achieved 93%. Availability and implementation Shades source code is available at https://bitbucket.org/satsumaimo/shades as a patch for Rosetta 3.8 with a curated protein structure database and ITEM library creation software. Supplementary information Supplementary data are available at Bioinformatics online.


2022 ◽  
Vol 23 (2) ◽  
pp. 858
Author(s):  
Sali Anies ◽  
Vincent Jallu ◽  
Julien Diharce ◽  
Tarun J. Narwani ◽  
Alexandre G. de Brevern

Integrin αIIbβ3, a glycoprotein complex expressed at the platelet surface, is involved in platelet aggregation and contributes to primary haemostasis. Several integrin αIIbβ3 polymorphisms prevent the aggregation that causes haemorrhagic syndromes, such as Glanzmann thrombasthenia (GT). Access to 3D structure allows understanding the structural effects of polymorphisms related to GT. In a previous analysis using Molecular Dynamics (MD) simulations of αIIb Calf-1 domain structure, it was observed that GT associated with single amino acid variation affects distant loops, but not the mutated position. In this study, experiments are extended to Calf-1, Thigh, and Calf-2 domains. Two loops in Calf-2 are unstructured and therefore are modelled expertly using biophysical restraints. Surprisingly, MD revealed the presence of rigid zones in these loops. Detailed analysis with structural alphabet, the Proteins Blocks (PBs), allowed observing local changes in highly flexible regions. The variant P741R located at C-terminal of Calf-1 revealed that the Calf-2 presence did not affect the results obtained with isolated Calf-1 domain. Simulations for Calf- 1+ Calf-2, and Thigh + Calf-1 variant systems are designed to comprehend the impact of five single amino acid variations in these domains. Distant conformational changes are observed, thus highlighting the potential role of allostery in the structural basis of GT.


2021 ◽  
Author(s):  
Kuan Pern Tan ◽  
Tejashree Rajaram Kanitkar ◽  
Kwoh Chee Keong ◽  
M.S. Madhusudhan

1.AbstractPredicting the functional consequences of single point mutations has relevance to protein function annotation and to clinical analysis/diagnosis. We developed and tested Packpred that makes use of a multi-body clique statistical potential in combination with a depth dependent amino acid substitution matrix (FADHM) and positional Shannon Entropy to predict the functional consequences of point mutations in proteins. Parameters were trained over a saturation mutagenesis data set of T4-lysozyme (1966 mutations). The method was tested over another saturation mutagenesis data set (CcdB; 1534 mutations) and the Missense3D data set (4099 mutations). The performance of Packpred was compared against those of six other contemporary methods. With MCC values of 0.42, 0.47 and 0.36 on the training and testing data sets respectively, Packpred outperforms all method in all data sets, with the exception of marginally underperforming to FADHM in the CcdB data set. On analyzing the results, we could build meta servers that chose best performing methods of wild type amino acids and for wild type-mutant amino acid pairs. This lead to an increase of MCC value of 0.40 and 0.51 for the two meta predictors respectively on the Missense3D data set. We conjecture that it is possible to improve accuracy with better meta predictors as among the 7 methods compared, at the least one method or another is able to correctly predict ∼99% of the data.


2018 ◽  
Author(s):  
Bernardina Scafuri ◽  
Angelo Facchiano ◽  
Anna Marabotti

The prediction of the stability of a protein is a very important issue in computational biology. Indeed, missense mutations are frequently associated to a change in protein stability, leading usually to destabilization, unfolding and aggregation. However, the direct measurement of the effect of mutations on proteins' stability is often impaired by the large number of mutations that can affect a protein sequence. Therefore, predicting the impact of a mutation on this feature is of remarkable interest to infer the phenotypic effects associated to a genotypic variation. For this reason, many different predictors of the effects of mutations on protein stability have been developed during the past years, and they are available online as Web servers. In the present work, we applied several tools based on different approaches to predict the stability of three proteins involved in the different forms of the rare disease galactosemia, and we compare their different results, describing also the problems that we had to face, the solutions that we have adopted and the lessons learnt from this case study.


2021 ◽  
Vol 8 ◽  
Author(s):  
Kuan Pern Tan ◽  
Tejashree Rajaram Kanitkar ◽  
Chee Keong Kwoh ◽  
Mallur Srivatsan Madhusudhan

Predicting the functional consequences of single point mutations has relevance to protein function annotation and to clinical analysis/diagnosis. We developed and tested Packpred that makes use of a multi-body clique statistical potential in combination with a depth-dependent amino acid substitution matrix (FADHM) and positional Shannon entropy to predict the functional consequences of point mutations in proteins. Parameters were trained over a saturation mutagenesis data set of T4-lysozyme (1,966 mutations). The method was tested over another saturation mutagenesis data set (CcdB; 1,534 mutations) and the Missense3D data set (4,099 mutations). The performance of Packpred was compared against those of six other contemporary methods. With MCC values of 0.42, 0.47, and 0.36 on the training and testing data sets, respectively, Packpred outperforms all methods in all data sets, with the exception of marginally underperforming in comparison to FADHM in the CcdB data set. A meta server analysis was performed that chose best performing methods of wild-type amino acids and for wild-type mutant amino acid pairs. This led to an increase in the MCC value of 0.40 and 0.51 for the two meta predictors, respectively, on the Missense3D data set. We conjecture that it is possible to improve accuracy with better meta predictors as among the seven methods compared, at least one method or another is able to correctly predict ∼99% of the data.


2019 ◽  
Vol 36 (4) ◽  
pp. 798-810 ◽  
Author(s):  
Jeffrey I Boucher ◽  
Troy W Whitfield ◽  
Ann Dauphin ◽  
Gily Nachum ◽  
Carl Hollins ◽  
...  

Abstract The evolution of HIV-1 protein sequences should be governed by a combination of factors including nucleotide mutational probabilities, the genetic code, and fitness. The impact of these factors on protein sequence evolution is interdependent, making it challenging to infer the individual contribution of each factor from phylogenetic analyses alone. We investigated the protein sequence evolution of HIV-1 by determining an experimental fitness landscape of all individual amino acid changes in protease. We compared our experimental results to the frequency of protease variants in a publicly available data set of 32,163 sequenced isolates from drug-naïve individuals. The most common amino acids in sequenced isolates supported robust experimental fitness, indicating that the experimental fitness landscape captured key features of selection acting on protease during viral infections of hosts. Amino acid changes requiring multiple mutations from the likely ancestor were slightly less likely to support robust experimental fitness than single mutations, consistent with the genetic code favoring chemically conservative amino acid changes. Amino acids that were common in sequenced isolates were predominantly accessible by single mutations from the likely protease ancestor. Multiple mutations commonly observed in isolates were accessible by mutational walks with highly fit single mutation intermediates. Our results indicate that the prevalence of multiple-base mutations in HIV-1 protease is strongly influenced by mutational sampling.


Sign in / Sign up

Export Citation Format

Share Document