scholarly journals Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants

2020 ◽  
Vol 117 (45) ◽  
pp. 28201-28211
Author(s):  
Sumaiya Iqbal ◽  
Eduardo Pérez-Palma ◽  
Jakob B. Jespersen ◽  
Patrick May ◽  
David Hoksza ◽  
...  

Interpretation of the colossal number of genetic variants identified from sequencing applications is one of the major bottlenecks in clinical genetics, with the inference of the effect of amino acid-substituting missense variations on protein structure and function being especially challenging. Here we characterize the three-dimensional (3D) amino acid positions affected in pathogenic and population variants from 1,330 disease-associated genes using over 14,000 experimentally solved human protein structures. By measuring the statistical burden of variations (i.e., point mutations) from all genes on 40 3D protein features, accounting for the structural, chemical, and functional context of the variations’ positions, we identify features that are generally associated with pathogenic and population missense variants. We then perform the same amino acid-level analysis individually for 24 protein functional classes, which reveals unique characteristics of the positions of the altered amino acids: We observe up to 46% divergence of the class-specific features from the general characteristics obtained by the analysis on all genes, which is consistent with the structural diversity of essential regions across different protein classes. We demonstrate that the function-specific 3D features of the variants match the readouts of mutagenesis experiments for BRCA1 and PTEN, and positively correlate with an independent set of clinically interpreted pathogenic and benign missense variants. Finally, we make our results available through a web server to foster accessibility and downstream research. Our findings represent a crucial step toward translational genetics, from highlighting the impact of mutations on protein structure to rationalizing the variants’ pathogenicity in terms of the perturbed molecular mechanisms.

2019 ◽  
Author(s):  
Sumaiya Iqbal ◽  
Jakob B. Jespersen ◽  
Eduardo Perez-Palma ◽  
Patrick May ◽  
David Hoksza ◽  
...  

AbstractInterpretation of the colossal number of genetic variants identified from sequencing applications is one of the major bottlenecks in clinical genetics, with the inference of the effect of amino acid-substituting missense variants on protein structure and function being especially challenging. Here we evaluated the burden of amino acids affected in pathogenic variants (n=32,923) compared to the variants (n=164,915) from the general population in 1,330 disease-associated genes on forty protein features using over 14,000 experimentally-solved 3D structures. By analyzing the whole gene/variant set jointly, we identified 18 features associated with 3D mutational hotspots that are generally important for protein fitness and stability. Individual analyses performed for twenty-four protein functional classes further revealed 240 characteristics of mutational hotspots in total, including new associations recapitulating the sheer diversity across proteins essential structural regions. We demonstrated that the function-specific features of variants correspond to the readouts of mutagenesis experiments and positively correlate with clinically-interpreted pathogenic and benign missense variants. Finally, we made our results available through a web server to foster accessibility and downstream research. Our findings represent a crucial step towards translational genetics, from highlighting the impact of mutations on protein structure to rationalizing the pathogenicity of variants in terms of the perturbed molecular mechanisms.


2020 ◽  
Vol 48 (W1) ◽  
pp. W132-W139
Author(s):  
Sumaiya Iqbal ◽  
David Hoksza ◽  
Eduardo Pérez-Palma ◽  
Patrick May ◽  
Jakob B Jespersen ◽  
...  

Abstract Human genome sequencing efforts have greatly expanded, and a plethora of missense variants identified both in patients and in the general population is now publicly accessible. Interpretation of the molecular-level effect of missense variants, however, remains challenging and requires a particular investigation of amino acid substitutions in the context of protein structure and function. Answers to questions like ‘Is a variant perturbing a site involved in key macromolecular interactions and/or cellular signaling?’, or ‘Is a variant changing an amino acid located at the protein core or part of a cluster of known pathogenic mutations in 3D?’ are crucial. Motivated by these needs, we developed MISCAST (missense variant to protein structure analysis web suite; http://miscast.broadinstitute.org/). MISCAST is an interactive and user-friendly web server to visualize and analyze missense variants in protein sequence and structure space. Additionally, a comprehensive set of protein structural and functional features have been aggregated in MISCAST from multiple databases, and displayed on structures alongside the variants to provide users with the biological context of the variant location in an integrated platform. We further made the annotated data and protein structures readily downloadable from MISCAST to foster advanced offline analysis of missense variants by a wide biological community.


2020 ◽  
Author(s):  
Martin Schwersensky ◽  
Marianne Rooman ◽  
Fabrizio Pucci

AbstractThe question of how natural evolution acts on DNA and protein sequences to ensure mutational robustness and evolvability has been asked for decades without definitive answer. We tackled this issue through a structurome-scale computational investigation, in which we estimated the change in folding free energy upon all possible single-site mutations introduced in more than 20,000 protein structures. The validity of our results are supported by a very good agreement with experimental mutagenesis data. At the amino acid level, we found the protein surface to be more robust to mutations than the core, in a protein length-dependent manner. About 4% of all mutations were shown to be stabilizing, and a majority of mutations on the surface and in the core to be neutral and destabilizing, respectively. At the nucleobase level, single base substitutions were shown to yield on average less destabilizing amino acid mutations than multiple base substitutions. More precisely, the smallest average destabilization occurs for substitutions of base III in the codon, followed by base I, bases I+III, and base II. This ranking highly anticorrelates with the frequency of codon-anticodon mispairing, and suggests that the standard genetic code is optimized more to limit translation errors than the impact of random mutations. Moreover, the codon usage also appears to be optimized for minimizing the errors at the protein level, especially for surface residues that evolve faster and have therefore been under stronger selection, and for biased codons, suggesting that the codon usage bias also partly aims to optimize protein mutational robustness.


2019 ◽  
Author(s):  
Lys Sanz Moreta ◽  
Rute R. da Fonseca

ABSTRACTThe visualization of the molecular context of an amino acid mutation in a protein structure is crucial for the assessment of its functional impact and to understand its evolutionary implications. Currently, searches for fast evolving amino acid positions using codon substitution models like those implemented in PAML [1] are done in almost complete proteomes, generating large numbers of candidate proteins that require individual structural analyses. Here we present two python wrapper scripts as the package Link Your Sites (LYS). The first one i) mines the RCSB database [10] using the blast alignment tool to find the best matching homologous sequences, ii) fetches their domain positions by using Prosites [3,8,9], iii) parses the output of PAML extracting the positional information of fast-evolving sites and transform them into the coordinate system of the protein structure, iv) outputs a file per gene with the positions correlations to its homologous sequence. The second script uses the output of the first one to generate the protein’s graphical assessment. LYS can therefore generate figures to be used in publication highlighting the positively selected sites mapped on regions that are known to have functional relevance and/or be used to reduce the number of targets that will be further analyzed by providing a list of those for which structural information can be retrieved.MotivationAutomatizing the search for protein structures to assess the functional impact of sites found to be under positive selection by codeml, implemented in PAML [1]. Building publication-quality figures highlighting the sites on a protein structure model that are within and outside functional domains. reduces the workload associated with selecting proteins for which a functional assessment of the impact of mutations can be done using a protein structure. This is especially relevant when analyzing almost complete proteomes which is the case of large comparative genomic studies.SoftwareLYS scripts are executed in the command line. They automatically search for homologous proteins at the RSCB database [10], determine the functional domain locations and correlate the positions pointed by the M8 model [1], and output a data frame that can be used as the input by PyMOL [7] to generate a visualization of the results.AvailabilityLYS is easy to install and implement and they are available at https://github.com/LysSanzMoreta/LYSAutomaticSearch


AMB Express ◽  
2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Neeraja Punde ◽  
Jennifer Kooken ◽  
Dagmar Leary ◽  
Patricia M. Legler ◽  
Evelina Angov

Abstract Codon usage frequency influences protein structure and function. The frequency with which codons are used potentially impacts primary, secondary and tertiary protein structure. Poor expression, loss of function, insolubility, or truncation can result from species-specific differences in codon usage. “Codon harmonization” more closely aligns native codon usage frequencies with those of the expression host particularly within putative inter-domain segments where slower rates of translation may play a role in protein folding. Heterologous expression of Plasmodium falciparum genes in Escherichia coli has been a challenge due to their AT-rich codon bias and the highly repetitive DNA sequences. Here, codon harmonization was applied to the malarial antigen, CelTOS (Cell-traversal protein for ookinetes and sporozoites). CelTOS is a highly conserved P. falciparum protein involved in cellular traversal through mosquito and vertebrate host cells. It reversibly refolds after thermal denaturation making it a desirable malarial vaccine candidate. Protein expressed in E. coli from a codon harmonized sequence of P. falciparum CelTOS (CH-PfCelTOS) was compared with protein expressed from the native codon sequence (N-PfCelTOS) to assess the impact of codon usage on protein expression levels, solubility, yield, stability, structural integrity, recognition with CelTOS-specific mAbs and immunogenicity in mice. While the translated proteins were expected to be identical, the translated products produced from the codon-harmonized sequence differed in helical content and showed a smaller distribution of polypeptides in mass spectra indicating lower heterogeneity of the codon harmonized version and fewer amino acid misincorporations. Substitutions of hydrophobic-to-hydrophobic amino acid were observed more commonly than any other. CH-PfCelTOS induced significantly higher antibody levels compared with N-PfCelTOS; however, no significant differences in either IFN-γ or IL-4 cellular responses were detected between the two antigens.


2018 ◽  
Vol 55 (10) ◽  
pp. 685-692 ◽  
Author(s):  
Xun Chu ◽  
Minjun Yang ◽  
Zhen-Ju Song ◽  
Yan Dong ◽  
Chong Li ◽  
...  

BackgroundThe classical human leucocyte antigen (HLA) genes were the most important genetic determinant for Graves’ disease (GD). The aim of the study was to fine map causal variants of the HLA genes.MethodsWe applied imputation with a Pan-Asian HLA reference panel to thoroughly investigate themajor histocompatibility complex (MHC) associations with GD down to the amino acid level of classical HLA genes in 1468 patients with GD and 1490 controls of Han Chinese.ResultsThe strongest finding across the HLA genes was the association with HLA-DPβ1 position 205 (Pomnibus=2.48×10−33). HLA-DPA1*02:02 was the strongest association among the classical HLA alleles, which was in perfect linkage disequilibrium with HLA-DPα1 residue Met11 (OR=1.90, Pbinary=1.76×10−31). Applying stepwise conditional analysis, we identified amino acid position 205 in HLA-DPβ1, position 66 and 99 in HLA-B and position 28 in HLA-DRβ1 explain majority of the MHC association to GD risk. We further evaluated risk of two clinical subtypes of GD, namely persistent thyroid stimulating hormone receptor antibody -positive (pTRAb+) group and ‘non-persistent TRAb positive’ (pTRAb−) group after antithyroid drug therapy. We found that HLA-B residues Lys66-Arg69-Val76 could drive pTRAb− GD risk alone, while HLA-DPβ1 position 205, HLA-B position 69 and 199 and HLA-DRβ1 position 28 drive pTRAb+ GD risk. The risk heterogeneity between pTRAb+ and pTRAb− GD might be driven by HLA-DPα1 Met11.ConclusionsFour amino acid positions could account for the associations of MHC with GD in Han Chinese. These distinct HLA association patterns indicated the two subtypes have distinct molecular mechanisms of pathogenesis.


Molecules ◽  
2020 ◽  
Vol 25 (4) ◽  
pp. 873 ◽  
Author(s):  
Faiza Rasheed ◽  
Joel Markgren ◽  
Mikael Hedenqvist ◽  
Eva Johansson

Proteins are among the most important molecules on Earth. Their structure and aggregation behavior are key to their functionality in living organisms and in protein-rich products. Innovations, such as increased computer size and power, together with novel simulation tools have improved our understanding of protein structure-function relationships. This review focuses on various proteins present in plants and modeling tools that can be applied to better understand protein structures and their relationship to functionality, with particular emphasis on plant storage proteins. Modeling of plant proteins is increasing, but less than 9% of deposits in the Research Collaboratory for Structural Bioinformatics Protein Data Bank come from plant proteins. Although, similar tools are applied as in other proteins, modeling of plant proteins is lagging behind and innovative methods are rarely used. Molecular dynamics and molecular docking are commonly used to evaluate differences in forms or mutants, and the impact on functionality. Modeling tools have also been used to describe the photosynthetic machinery and its electron transfer reactions. Storage proteins, especially in large and intrinsically disordered prolamins and glutelins, have been significantly less well-described using modeling. These proteins aggregate during processing and form large polymers that correlate with functionality. The resulting structure-function relationships are important for processed storage proteins, so modeling and simulation studies, using up-to-date models, algorithms, and computer tools are essential for obtaining a better understanding of these relationships.


F1000Research ◽  
2014 ◽  
Vol 3 ◽  
pp. 217 ◽  
Author(s):  
Sandeep Chakraborty ◽  
Basuthkar J. Rao ◽  
Bjarni Asgeirsson ◽  
Ravindra Venkatramani ◽  
Abhaya M. Dandekar

The remarkable diversity in biological systems is rooted in the ability of the twenty naturally occurring amino acids to perform multifarious catalytic functions by creating unique structural scaffolds known as the active site. Finding such structrual motifs within the protein structure is a key aspect of many computational methods. The algorithm for obtaining combinations of motifs of a certain length, although polynomial in complexity, runs in non-trivial computer time. Also, the search space expands considerably if stereochemically equivalent residues are allowed to replace an amino acid in the motif. In the present work, we propose a method to precompile all possible motifs comprising of a set (n=4 in this case) of predefined amino acid residues from a protein structure that occur within a specified distance (R) of each other (PREMONITION). PREMONITION rolls a sphere of radius R along the protein fold centered at the C atom of each residue, and all possible motifs are extracted within this sphere. The number of residues that can occur within a sphere centered around a residue is bounded by physical constraints, thus setting an upper limit on the processing times. After such a pre-compilation step, the computational time required for querying a protein structure with multiple motifs is considerably reduced. Previously, we had proposed a computational method to estimate the promiscuity of proteins with known active site residues and 3D structure using a database of known active sites in proteins (CSA) by querying each protein with the active site motif of every other residue. The runtimes for such a comparison is reduced from days to hours using the PREMONITION methodology.


2021 ◽  
Vol 18 (1) ◽  
Author(s):  
Yutian Wang ◽  
Weiyang Sun ◽  
Zhenfei Wang ◽  
Menglin Zhao ◽  
Xinghai Zhang ◽  
...  

Abstract Background In 2011, a new influenza virus, named Influenza D Virus (IDV), was isolated from pigs, and then cattle, presenting influenza-like symptoms. IDV is one of the causative agents of Bovine Respiratory Disease (BRD), which causes high morbidity and mortality in feedlot cattle worldwide. To date, the molecular mechanisms of IDV pathogenicity are unknown. Recent IDV outbreaks in cattle, along with serological and genetic evidence of IDV infection in humans, have raised concerns regarding the zoonotic potential of this virus. Influenza virus polymerase is a determining factor of viral pathogenicity to mammals. Methods Here we take a prospective approach to this question by creating a random mutation library about PB2 subunit of the IDV viral polymerase to test which amino acid point mutations will increase viral polymerase activity, leading to increased pathogenicity of the virus. Results Our work shows some exact sites that could affect polymerase activities in influenza D viruses. For example, two single-site mutations, PB2-D533S and PB2-G603Y, can independently increase polymerase activity. The PB2-D533S mutation alone can increase the polymerase activity by 9.92 times, while the PB2-G603Y mutation increments the activity by 8.22 times. Conclusion Taken together, our findings provide important insight into IDV replication fitness mediated by the PB2 protein, increasing our understanding of IDV replication and pathogenicity and facilitating future studies.


2021 ◽  
Author(s):  
Chris Papadopoulos ◽  
Isabelle Callebaut ◽  
Jean-Christophe Gelly ◽  
Isabelle Hatin ◽  
Olivier Namy ◽  
...  

The noncoding genome plays an important role in de novo gene birth and in the emergence of genetic novelty. Nevertheless, how noncoding sequences' properties could promote the birth of novel genes and shape the evolution and the structural diversity of proteins remains unclear. Therefore, by combining different bioinformatic approaches, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic ORFs (Open Reading Frames) of S. cerevisiae with the aim of (i) exploring whether the large structural diversity observed in proteomes is already present in noncoding sequences, and (ii) estimating the potential of the noncoding genome to produce novel protein bricks that can either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Moreover, they encompass the large structural diversity of canonical proteins with strikingly the majority predicted as foldable. Then, we investigated the early stages of de novo gene birth by identifying intergenic ORFs with a strong translation signal in ribosome profiling experiments and by reconstructing the ancestral sequences of 70 yeast de novo genes. This enabled us to highlight sequence and structural factors determining de novo gene emergence. Finally, we showed a strong correlation between the fold potential of de novo proteins and the one of their ancestral amino acid sequences, reflecting the relationship between the noncoding genome and the protein structure universe.


Sign in / Sign up

Export Citation Format

Share Document