scholarly journals Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants

Author(s):  
Emidio Capriotti ◽  
Piero Fariselli

Abstract Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. This observation indicates that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.

2020 ◽  
Author(s):  
Aashish Jain ◽  
Genki Terashi ◽  
Yuki Kagaya ◽  
Sai Raghavendra Maddhuri Venkata Subramaniya ◽  
Charles Christoffer ◽  
...  

ABSTRACTProtein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. The model is trained in a multi-task fashion to also predict backbone and orientation angles further improving the inter-residue distance prediction. We show that AttentiveDist outperforms the top methods for contact prediction in the CASP13 structure prediction competition. To aid in structure modeling we also developed two new deep learning-based sidechain center distance and peptide-bond nitrogen-oxygen distance prediction models. Together these led to a 12% increase in TM-score from the best server method in CASP13 for structure prediction.


2021 ◽  
Author(s):  
Konstantin Weissenow ◽  
Michael Heinzinger ◽  
Burkhard Rost

All state-of-the-art (SOTA) protein structure predictions rely on evolutionary information captured in multiple sequence alignments (MSAs), primarily on evolutionary couplings (co-evolution). Such information is not available for all proteins and is computationally expensive to generate. Prediction models based on Artificial Intelligence (AI) using only single sequences as input are easier and cheaper but perform so poorly that speed becomes irrelevant. Here, we described the first competitive AI solution exclusively inputting embeddings extracted from pre-trained protein Language Models (pLMs), namely from the transformer pLM ProtT5, from single sequences into a relatively shallow (few free parameters) convolutional neural network (CNN) trained on inter-residue distances, i.e. protein structure in 2D. The major advance originated from processing the attention heads learned by ProtT5. Although these models required at no point any MSA, they matched the performance of methods relying on co-evolution. Although not reaching the very top, our lean approach came close at substantially lower costs thereby speeding up development and each future prediction. By generating protein-specific rather than family-averaged predictions, these new solutions could distinguish between structural features differentiating members of the same family of proteins with similar structure predicted alike by all other top methods.


2021 ◽  
Author(s):  
Michael Bernhofer ◽  
Christian Dallago ◽  
Tim Karl ◽  
Venkata Satagopam ◽  
Michael Heinzinger ◽  
...  

AbstractSince 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein’s infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold; user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.AvailabilityFreely accessible webserverPredictProtein.org; Source and docker images: github.com/rostlab


2015 ◽  
Vol 13 (05) ◽  
pp. 1550028 ◽  
Author(s):  
Westley Arthur Sherman ◽  
Durga Bhavani Kuchibhatla ◽  
Vachiranee Limviphuvadh ◽  
Sebastian Maurer-Stroh ◽  
Birgit Eisenhaber ◽  
...  

Next-generation sequencing advances are rapidly expanding the number of human mutations to be analyzed for causative roles in genetic disorders. Our Human Protein Mutation Viewer (HPMV) is intended to explore the biomolecular mechanistic significance of non-synonymous human mutations in protein-coding genomic regions. The tool helps to assess whether protein mutations affect the occurrence of sequence-architectural features (globular domains, targeting signals, post-translational modification sites, etc.). As input, HPMV accepts protein mutations — as UniProt accessions with mutations (e.g. HGVS nomenclature), genome coordinates, or FASTA sequences. As output, HPMV provides an interactive cartoon showing the mutations in relation to elements of the sequence architecture. A large variety of protein sequence architectural features were selected for their particular relevance to mutation interpretation. Clicking a sequence feature in the cartoon expands a tree view of additional information including multiple sequence alignments of conserved domains and a simple 3D viewer mapping the mutation to known PDB structures, if available. The cartoon is also correlated with a multiple sequence alignment of similar sequences from other organisms. In cases where a mutation is likely to have a straightforward interpretation (e.g. a point mutation disrupting a well-understood targeting signal), this interpretation is suggested. The interactive cartoon can be downloaded as standalone viewer in Java jar format to be saved and viewed later with only a standard Java runtime environment. The HPMV website is: http://hpmv.bii.a-star.edu.sg/ .


2021 ◽  
Author(s):  
Belen Escobari ◽  
Thomas Borsch ◽  
Taylor S. Quedensley ◽  
Michael Gruenstaeudl

ABSTRACTPREMISEThe genus Gynoxys and relatives form a species-rich lineage of Andean shrubs and trees with low genetic distances within the sunflower subtribe Tussilaginineae. Previous molecular phylogenetic investigations of the Tussilaginineae have included few, if any, representatives of this Gynoxoid group or reconstructed ambiguous patterns of relationships for it.METHODSWe sequenced complete plastid genomes of 21 species of the Gynoxoid group and related Tussilaginineae and conducted detailed comparisons of the phylogenetic relationships supported by the gene, intron, and intergenic spacer partitions of these genomes. We also evaluated the impact of manual, motif-based adjustments of automatic DNA sequence alignments on phylogenetic tree inference.RESULTSOur results indicate that the inclusion of all plastid genome partitions is needed to infer fully resolved phylogenetic trees of the Gynoxoid group. Whole plastome-based tree inference suggests that the genera Gynoxys and Nordenstamia are polyphyletic and form the core clade of the Gynoxoid group. This clade is sister to a clade of Aequatorium and Paragynoxys and also includes some but not all representatives of Paracalia.CONCLUSIONSThe concatenation and combined analysis of all plastid genome partitions and the construction of manually curated, motif-based DNA sequence alignments are found to be instrumental in the recovery of strongly supported relationships of the Gynoxoid group. We demonstrate that the correct assessment of homology in genome-level plastid sequence datasets is crucial for subsequent phylogeny reconstruction and that the manual post-processing of multiple sequence alignments improves the reliability of such reconstructions amid low genetic distances between taxa.


2020 ◽  
Author(s):  
Dustin J. Wcisel ◽  
J. Thomas Howard ◽  
Jeffrey A. Yoder ◽  
Alex Dornburg

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource. Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question. Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.


2019 ◽  
Author(s):  
Alex Dornburg ◽  
Dustin J. Wcisel ◽  
J. Thomas Howard ◽  
Jeffrey A. Yoder

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource.Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question.Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.


2020 ◽  
Vol 48 (18) ◽  
pp. e105-e105 ◽  
Author(s):  
Volodymyr Tsybulskyi ◽  
Mohamed Mounir ◽  
Irmtraud M Meyer

Abstract Interactions between biological entities are key to understanding their potential functional roles. Three fields of research have recently made particular progress: the investigation of transRNA–RNA and RNA–DNA transcriptome interactions and of trans DNA–DNA genome interactions. We now have both experimental and computational methods for examining these interactions in vivo and on a transcriptome- and genome-wide scale, respectively. Often, key insights can be gained by visually inspecting figures that manage to combine different sources of evidence and quantitative information. We here present R-chie, a web server and R package for visualizing cis and transRNA–RNA, RNA–DNA and DNA–DNA interactions. For this, we have completely revised and significantly extended an earlier version of R-chie (1) which was initially introduced for visualizing RNA secondary structure features. The new R-chie offers a range of unique features for visualizing cis and transRNA–RNA, RNA–DNA and DNA–DNA interactions. Particularly note-worthy features include the ability to incorporate evolutionary information, e.g. multiple-sequence alignments, to compare two alternative sets of information and to incorporate detailed, quantitative information. R-chie is readily available via a web server as well as a corresponding R package called R4RNA which can be used to run the software locally.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Edward J. Martin ◽  
Thomas R. Meagher ◽  
Daniel Barker

Abstract Background The use of sound to represent sequence data—sonification—has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.


2017 ◽  
Author(s):  
Diego Javier Zea ◽  
Alexander Miguel Monzon ◽  
Gustavo Parisi ◽  
Cristina Marino-Buslje

AbstractConservation and covariation measures, as other evolutionary analysis, require a high number of distant homologous sequences, therefore a lot of structural divergence can be expected in such divergent alignments. However, most works linking evolutionary and structural information use a single structure ignoring the structural variability inside a protein family. That common practice seems unrealistic to the light of this work.In this work we studied how structural divergence affects conservation and covariation estimations. We uncover that, within a protein family, ~51% of multiple sequence alignment columns change their exposed/buried status between structures. Also, ~53% of residue pairs that are in contact in one structure are not in contact in another structure from the same family. We found out that residue conservation is not directly related to the relative solvent accessible surface area of a single protein structure. Using information from all the available structures rather than from a single representative structure gives more confidence in the structural interpretation of the evolutionary signals. That is particularly important for diverse multiple sequence alignments, where structures can drastically differ. High covariation scores tend to indicate residue contacts that are conserved in the family, therefore, are not suitable to find protein/conformer specific contacts.Our results suggest that structural divergence should be considered for a better understanding of protein function, to transfer annotation by homology and to model protein evolution.


Sign in / Sign up

Export Citation Format

Share Document