scholarly journals Human Missense Variation is Constrained by Domain Structure and Highlights Functional and Pathogenic Residues

2017 ◽  
Author(s):  
Stuart A. MacGowan ◽  
Fábio Madeira ◽  
Thiago Britto-Borges ◽  
Melanie S. Schmittner ◽  
Christian Cole ◽  
...  

Human genome sequencing has generated population variant datasets containing millions of variants from hundreds of thousands of individuals1-3. The datasets show the genomic distribution of genetic variation to be influenced on genic and sub-genic scales by gene essentiality,1,4,5 protein domain architecture6 and the presence of genomic features such as splice donor/acceptor sites.2 However, the variant data are still too sparse to provide a comparative picture of genetic variation between individual protein residues in the proteome.1,6 Here, we overcome this sparsity for ∼25,000 human protein domains in 1,291 domain families by aggregating variants over equivalent positions (columns) in multiple sequence alignments of sequence-similar (paralagous) domains7,8. We then compare the resulting variation profiles from the human population to residue conservation across all species9 and find that the same tertiary structural and functional pressures that affect amino acid conservation during domain evolution constrain missense variant distributions. Thus, depletion of missense variants at a position implies that it is structurally or functionally important. We find such positions are enriched in known disease-associated variants (OR = 2.83, p ≈ 0) while positions that are both missense depleted and evolutionary conserved are further enriched in disease-associated variants (OR = 1.85, p = 3.3×10-17) compared to those that are only evolutionary conserved (OR = 1.29, p = 4.5×10-19). Unexpectedly, a subset of evolutionary Unconserved positions are Missense Depleted in human (UMD positions) and these are also enriched in pathogenic variants (OR = 1.74, p = 0.02). UMD positions are further differentiated from other unconserved residues in that they are enriched in ligand, DNA and protein binding interactions (OR = 1.59, p = 0.003), which suggests this stratification can identify functionally important positions. A different class of positions that are Conserved and Missense Enriched (CME) show an enrichment of ClinVar risk factor variants (OR = 2.27, p = 0.004). We illustrate these principles with the G-Protein Coupled Receptor (GPCR) family, Nuclear Receptor Ligand Binding Domain family and In Between Ring-Finger (IBR) domains and list a total of 343 UMD positions in 211 domain families. This study will have broad applications to: (a) providing focus for functional studies of specific proteins by mutagenesis; (b) refining pathogenicity prediction models; (c) highlighting which residue interactions to target when refining the specificity of small-molecule drugs.

2020 ◽  
Vol 21 (2) ◽  
pp. 513 ◽  
Author(s):  
Marzia Tindara Venuto ◽  
Mathieu Decloquement ◽  
Joan Martorell Ribera ◽  
Maxence Noel ◽  
Alexander Rebl ◽  
...  

We identified and analyzed α2,8-sialyltransferases sequences among 71 ray-finned fish species to provide the first comprehensive view of the Teleost ST8Sia repertoire. This repertoire expanded over the course of Vertebrate evolution and was primarily shaped by the whole genome events R1 and R2, but not by the Teleost-specific R3. We showed that duplicated st8sia genes like st8sia7, st8sia8, and st8sia9 have disappeared from Tetrapods, whereas their orthologues were maintained in Teleosts. Furthermore, several fish species specific genome duplications account for the presence of multiple poly-α2,8-sialyltransferases in the Salmonidae (ST8Sia II-r1 and ST8Sia II-r2) and in Cyprinus carpio (ST8Sia IV-r1 and ST8Sia IV-r2). Paralogy and synteny analyses provided more relevant and solid information that enabled us to reconstruct the evolutionary history of st8sia genes in fish genomes. Our data also indicated that, while the mammalian ST8Sia family is comprised of six subfamilies forming di-, oligo-, or polymers of α2,8-linked sialic acids, the fish ST8Sia family, amounting to a total of 10 genes in fish, appears to be much more diverse and shows a patchy distribution among fish species. A focus on Salmonidae showed that (i) the two copies of st8sia2 genes have overall contrasted tissue-specific expressions, with noticeable changes when compared with human co-orthologue, and that (ii) st8sia4 is weakly expressed. Multiple sequence alignments enabled us to detect changes in the conserved polysialyltransferase domain (PSTD) of the fish sequences that could account for variable enzymatic activities. These data provide the bases for further functional studies using recombinant enzymes.


2020 ◽  
Author(s):  
Aashish Jain ◽  
Genki Terashi ◽  
Yuki Kagaya ◽  
Sai Raghavendra Maddhuri Venkata Subramaniya ◽  
Charles Christoffer ◽  
...  

ABSTRACTProtein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. The model is trained in a multi-task fashion to also predict backbone and orientation angles further improving the inter-residue distance prediction. We show that AttentiveDist outperforms the top methods for contact prediction in the CASP13 structure prediction competition. To aid in structure modeling we also developed two new deep learning-based sidechain center distance and peptide-bond nitrogen-oxygen distance prediction models. Together these led to a 12% increase in TM-score from the best server method in CASP13 for structure prediction.


2021 ◽  
Author(s):  
Konstantin Weissenow ◽  
Michael Heinzinger ◽  
Burkhard Rost

All state-of-the-art (SOTA) protein structure predictions rely on evolutionary information captured in multiple sequence alignments (MSAs), primarily on evolutionary couplings (co-evolution). Such information is not available for all proteins and is computationally expensive to generate. Prediction models based on Artificial Intelligence (AI) using only single sequences as input are easier and cheaper but perform so poorly that speed becomes irrelevant. Here, we described the first competitive AI solution exclusively inputting embeddings extracted from pre-trained protein Language Models (pLMs), namely from the transformer pLM ProtT5, from single sequences into a relatively shallow (few free parameters) convolutional neural network (CNN) trained on inter-residue distances, i.e. protein structure in 2D. The major advance originated from processing the attention heads learned by ProtT5. Although these models required at no point any MSA, they matched the performance of methods relying on co-evolution. Although not reaching the very top, our lean approach came close at substantially lower costs thereby speeding up development and each future prediction. By generating protein-specific rather than family-averaged predictions, these new solutions could distinguish between structural features differentiating members of the same family of proteins with similar structure predicted alike by all other top methods.


2021 ◽  
Vol 2021 ◽  
pp. 1-6
Author(s):  
Somayyeh Hashemian ◽  
Reza Jafarzadeh Esfehani ◽  
Siroos Karimdadi ◽  
Nosrat Ghaemi ◽  
Peyman Eshraghi ◽  
...  

Background. Congenital hyperinsulinism (CHI) is a heterogeneous disease with various underlying genetic causes. Among different genes considered effective in the development of CHI, ABCC8, KCNJ11, and HADH genes are among the important genes, especially in a population with a considerable rate of consanguineous marriage. Mutational analysis of these genes guides clinicians to better treatment and prediction of prognosis for this rare disease. The present study aimed to evaluate genetic variants in ABCC8, KCNJ11, and HADH genes as causative genes for CHI in the Iranian population. Methods. The present case series took place in Mashhad, Iran, within 11 years. Every child who had a clinical phenotype and confirmatory biochemical tests of CHI enrolled in this study. Variants in ABCC8, KCNJ11, and HADH genes were analyzed by the polymerase chain reaction and sequencing in our patients. Results. Among 20 pediatric patients, 16 of them had variants in ABCC8, KCNJ11, and HADH genes. The mean age of genetic diagnosis was 18.6 days. A homozygous missense (c.2041-21G > A) mutation in the ABCC8 gene was seen in three infants. Other common variants were frameshift variants (c.3438dup) in the ABCC8 gene and a missense variant (c.287-288delinsTG) in the KCNJ11 gene. Most of the variants in our population were still categorized as variants of unknown significance and only 7 pathogenic variants were present. Conclusion. Most variants were located in the ABCC8 gene in our population. Because most of the variants in our population are not previously reported, performing further functional studies is warranted.


1999 ◽  
Vol 10 (11) ◽  
pp. 2342-2351 ◽  
Author(s):  
DAVID M. REYNOLDS ◽  
TOMOHITO HAYASHI ◽  
YIQIANG CAI ◽  
BARBERA VELDHUISEN ◽  
TERRY J. WATNICK ◽  
...  

Abstract. It is estimated that approximately 15% of families with autosomal dominant polycystic kidney disease (ADPKD) have mutations in PKD2. Identification of these mutations is central to identifying functionally important regions of gene and to understanding the mechanisms underlying the pathogenesis of the disorder. The current study describes mutations in six type 2 ADPKD families. Two single base substitution mutations discovered in the ORF in exon 14 constitute the most COOH-terminal pathogenic variants described to date. One of these mutations is a nonsense change and the other encodes an apparent missense variant. Reverse transcription-PCR from patient lymphoblast RNA showed that, in addition, both mutations resulted in out-of-frame splice variants by activating cryptic splice sites via different mechanisms. The apparent missense variant produced such a strong splicing signal that the processed transcript from the mutant chromosome did not contain any of the normally spliced, missense product. A third mutation, a nonconservative missense change effecting a negatively charged residue in the third transmembrane span, is likely pathogenic and defines a highly conserved residue consistent with a potential channel subunit function for polycystin-2. The remaining three mutations included two frame shifts resulting from deletion of one or two bases in exons 6 and 10, respectively, and a nonsense mutation due to a single base substitution in exon 4. The study also defined a novel intragenic polymorphism in exon 1 that will be useful in analyzing “second hits” in PKD2. Finally, the study demonstrates that there are reduced levels of normal polycystin-2 protein in lymphoblast lines from PKD2-affected individuals and that truncated mutant polycystin-2 cannot be detected in patient lymphoblasts, suggesting that the latter may be unstable in at least some tissues. The mutations described will serve as critical reagents for future functional studies in PKD2.


2021 ◽  
Author(s):  
Emidio Capriotti ◽  
Piero Fariselli

Abstract Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. This observation indicates that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.


2019 ◽  
Vol 57 (2) ◽  
pp. 138-144 ◽  
Author(s):  
Laurence Hubert ◽  
Magda Cannata Serio ◽  
Laure Villoing-Gaudé ◽  
Nathalie Boddaert ◽  
Anna Kaminska ◽  
...  

BackgroundAutistic spectrum disorders (ASDs) with developmental delay and seizures are a genetically heterogeneous group of diseases caused by at least 700 different genes. Still, a number of cases remain genetically undiagnosed.ObjectiveThe objective of this study was to identify and characterise pathogenic variants in two individuals from unrelated families, both of whom presented a similar clinical phenotype that included an ASD, intellectual disability (ID) and seizures.MethodsWhole-exome sequencing was used to identify pathogenic variants in the two individuals. Functional studies performed in the Drosophila melanogaster model was used to assess the protein function in vivo.ResultsProbands shared a heterozygous de novo secretory carrier membrane protein (SCAMP5) variant (NM_001178111.1:c.538G>T) resulting in a p.Gly180Trp missense variant. SCAMP5 belongs to a family of tetraspanin membrane proteins found in secretory and endocytic compartments of neuronal synapses. In the fly SCAMP orthologue, the p.Gly302Trp genotype corresponds to human p.Gly180Trp. Western blot analysis of proteins overexpressed in the Drosophila fat body showed strongly reduced levels of the SCAMP p.Gly302Trp protein compared with the wild-type protein, indicating that the mutant either reduced expression or increased turnover of the protein. The expression of the fly homologue of the human SCAMP5 p.Gly180Trp mutation caused similar eye and neuronal phenotypes as the expression of SCAMP RNAi, suggesting a dominant-negative effect.ConclusionOur study identifies SCAMP5 deficiency as a cause for ASD and ID and underscores the importance of synaptic vesicular trafficking in neurodevelopmental disorders.


2021 ◽  
Vol 12 ◽  
Author(s):  
Mehmet Birikmen ◽  
Katherine E. Bohnsack ◽  
Vinh Tran ◽  
Sharvari Somayaji ◽  
Markus T. Bohnsack ◽  
...  

Ribosome assembly is an essential and carefully choreographed cellular process. In eukaryotes, several 100 proteins, distributed across the nucleolus, nucleus, and cytoplasm, co-ordinate the step-wise assembly of four ribosomal RNAs (rRNAs) and approximately 80 ribosomal proteins (RPs) into the mature ribosomal subunits. Due to the inherent complexity of the assembly process, functional studies identifying ribosome biogenesis factors and, more importantly, their precise functions and interplay are confined to a few and very well-established model organisms. Although best characterized in yeast (Saccharomyces cerevisiae), emerging links to disease and the discovery of additional layers of regulation have recently encouraged deeper analysis of the pathway in human cells. In archaea, ribosome biogenesis is less well-understood. However, their simpler sub-cellular structure should allow a less elaborated assembly procedure, potentially providing insights into the functional essentials of ribosome biogenesis that evolved long before the diversification of archaea and eukaryotes. Here, we use a comprehensive phylogenetic profiling setup, integrating targeted ortholog searches with automated scoring of protein domain architecture similarities and an assessment of when search sensitivity becomes limiting, to trace 301 curated eukaryotic ribosome biogenesis factors across 982 taxa spanning the tree of life and including 727 archaea. We show that both factor loss and lineage-specific modifications of factor function modulate ribosome biogenesis, and we highlight that limited sensitivity of the ortholog search can confound evolutionary conclusions. Projecting into the archaeal domain, we find that only few factors are consistently present across the analyzed taxa, and lineage-specific loss is common. While members of the Asgard group are not special with respect to their inventory of ribosome biogenesis factors (RBFs), they unite the highest number of orthologs to eukaryotic RBFs in one taxon. Using large ribosomal subunit maturation as an example, we demonstrate that archaea pursue a simplified version of the corresponding steps in eukaryotes. Much of the complexity of this process evolved on the eukaryotic lineage by the duplication of ribosomal proteins and their subsequent functional diversification into ribosome biogenesis factors. This highlights that studying ribosome biogenesis in archaea provides fundamental information also for understanding the process in eukaryotes.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Galya V. Klink ◽  
Hannah O’Keefe ◽  
Amrita Gogna ◽  
Georgii A. Bazykin ◽  
Joanna L. Elson

AbstractDisease caused by mutations of mitochondrial DNA (mtDNA) are highly variable in both presentation and penetrance. Over the last 30 years, clinical recognition of this group of diseases has increased. It has been suggested that haplogroup background could influence the penetrance and presentation of disease-causing mutations; however, to date there is only one well-established example of such an effect: the increased penetrance of two Complex I Leber's hereditary optic neuropathy mutations on a haplogroup J background. This paper conducts the most extensive investigation to date into the importance of haplogroup context in the pathogenicity of mtDNA mutations in Complex I. We searched for proven human point mutations across more than 900 metazoans finding human disease-causing mutations and potential masking variants. We found more than a half of human pathogenic variants as compensated pathogenic deviations (CPD) in at least in one animal species from our multiple sequence alignments. Some variants were found in many species, and some were even the most prevalent amino acids across our dataset. Variants were also found in other primates, and in such cases, we looked for non-human amino acids in sites with high probability to interact with the CPD in folded protein. Using this “local interactions” approach allowed us to find potential masking substitutions in other amino acid sites. We suggest that the masking variants might arise in humans, resulting in variability of mutation effect in our species.


2018 ◽  
Author(s):  
Ravi Patel ◽  
Sudhir Kumar

AbstractBackgroundThe evolutionary probability (EP) of an allele in a DNA or protein sequence predicts evolutionarily permissible (ePerm; EP ≥ 0.05) and forbidden (eForb; EP < 0.05) variants. EP of an allele represents an independent evolutionary expectation of observing an allele in a population based solely on the long-term substitution patterns captured in a multiple sequence alignment. In the neutral theory, EP and population frequencies can be compared to identify neutral and non-neutral alleles. This approach has been used to discover candidate adaptive polymorphisms in humans, which are eForbs segregating with high frequencies. The original method to compute EP requires the evolutionary relationships and divergence times of species in the sequence alignment (a timetree), which are not known with certainty for most datasets. This requirement impedes a general use of the original EP formulation. Here, we present an approach in which the phylogeny and times are inferred from the sequence alignment itself prior to the EP calculation. We evaluate if the modified EP approach produces results that are similar to those from the original method.ResultsWe compared EP estimates from the original and the modified approaches by using more than 18,000 protein sequence alignments containing orthologous sequences from 46 vertebrate species. For the original EP calculations, we used species relationships from UCSC and divergence times from TimeTree web resource, and the resulting EP estimates were considered to be the ground truth. We found that the modified approaches produced reasonable EP estimates for HGMD disease missense variant and 1000 Genomes Project missense variant datasets. Our results showed that reliable estimates of EP can be obtained without a priori knowledge of the sequence phylogeny and divergence times. We also found that, in order to obtain robust EP estimates, it is important to assemble a dataset with many sequences, sampling from a diversity of species groups.ConclusionWe conclude that the modified EP approach will be generally applicable for alignments and enable the detection of potentially neutral, deleterious, and adaptive alleles in populations.


Sign in / Sign up

Export Citation Format

Share Document