scholarly journals Combined alignments of sequences and domains characterize unknown proteins with remotely related protein search PSISearch2D

Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Minglei Yang ◽  
Wenliang Zhang ◽  
Guocai Yao ◽  
Haiyue Zhang ◽  
Weizhong Li

Abstract Iterative homology search has been widely used in identification of remotely related proteins. Our previous study has found that the query-seeded sequence iterative search can reduce homologous over-extension errors and greatly improve selectivity. However, iterative homology search remains challenging in protein functional prediction. More sensitive scoring models are highly needed to improve the predictive performance of the alignment methods, and alignment annotation with better visualization has also become imperative for result interpretation. Here we report an open-source application PSISearch2D that runs query-seeded iterative sequence search for remotely related protein detection. PSISearch2D retrieves domain annotation from Pfam, UniProtKB, CDD and PROSITE for resulting hits and demonstrates combined domain and sequence alignments in novel visualizations. A scoring model called C-value is newly defined to re-order hits with consideration of the combination of sequence and domain alignments. The benchmarking on the use of C-value indicates that PSISearch2D outperforms the original PSISearch2 tool in terms of both accuracy and specificity. PSISearch2D improves the characterization of unknown proteins in remote protein detection. Our evaluation tests show that PSISearch2D has provided annotation for 77 695 of 139 503 unknown bacteria proteins and 140 751 of 352 757 unknown virus proteins in UniProtKB, about 2.3-fold and 1.8-fold more characterization than the original PSISearch2, respectively. Together with advanced features of auto-iteration mode to handle large-scale data and optional programs for global and local sequence alignments, PSISearch2D enhances remotely related protein search.


2021 ◽  
Vol 13 (11) ◽  
pp. 2074
Author(s):  
Ryan R. Reisinger ◽  
Ari S. Friedlaender ◽  
Alexandre N. Zerbini ◽  
Daniel M. Palacios ◽  
Virginia Andrews-Goff ◽  
...  

Machine learning algorithms are often used to model and predict animal habitat selection—the relationships between animal occurrences and habitat characteristics. For broadly distributed species, habitat selection often varies among populations and regions; thus, it would seem preferable to fit region- or population-specific models of habitat selection for more accurate inference and prediction, rather than fitting large-scale models using pooled data. However, where the aim is to make range-wide predictions, including areas for which there are no existing data or models of habitat selection, how can regional models best be combined? We propose that ensemble approaches commonly used to combine different algorithms for a single region can be reframed, treating regional habitat selection models as the candidate models. By doing so, we can incorporate regional variation when fitting predictive models of animal habitat selection across large ranges. We test this approach using satellite telemetry data from 168 humpback whales across five geographic regions in the Southern Ocean. Using random forests, we fitted a large-scale model relating humpback whale locations, versus background locations, to 10 environmental covariates, and made a circumpolar prediction of humpback whale habitat selection. We also fitted five regional models, the predictions of which we used as input features for four ensemble approaches: an unweighted ensemble, an ensemble weighted by environmental similarity in each cell, stacked generalization, and a hybrid approach wherein the environmental covariates and regional predictions were used as input features in a new model. We tested the predictive performance of these approaches on an independent validation dataset of humpback whale sightings and whaling catches. These multiregional ensemble approaches resulted in models with higher predictive performance than the circumpolar naive model. These approaches can be used to incorporate regional variation in animal habitat selection when fitting range-wide predictive models using machine learning algorithms. This can yield more accurate predictions across regions or populations of animals that may show variation in habitat selection.



1996 ◽  
Vol 50 (5) ◽  
pp. 1591-1603 ◽  
Author(s):  
Thierry Massfelder ◽  
Andrew F. Stewart ◽  
Karlhans Endlich ◽  
Neil Soifer ◽  
Clément Judes ◽  
...  


2020 ◽  
Author(s):  
Ramon Viñas ◽  
Tiago Azevedo ◽  
Eric R. Gamazon ◽  
Pietro Liò

AbstractA question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we present GAIN-GTEx, a method for gene expression imputation based on Generative Adversarial Imputation Networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We compare our model to several standard and state-of-the-art imputation methods and show that GAIN-GTEx is significantly superior in terms of predictive performance and runtime. Furthermore, our results indicate strong generalisation on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.



2021 ◽  
Author(s):  
Hyeyoung Koh ◽  
Hannah Beth Blum

This study presents a machine learning-based approach for sensitivity analysis to examine how parameters affect a given structural response while accounting for uncertainty. Reliability-based sensitivity analysis involves repeated evaluations of the performance function incorporating uncertainties to estimate the influence of a model parameter, which can lead to prohibitive computational costs. This challenge is exacerbated for large-scale engineering problems which often carry a large quantity of uncertain parameters. The proposed approach is based on feature selection algorithms that rank feature importance and remove redundant predictors during model development which improve model generality and training performance by focusing only on the significant features. The approach allows performing sensitivity analysis of structural systems by providing feature rankings with reduced computational effort. The proposed approach is demonstrated with two designs of a two-bay, two-story planar steel frame with different failure modes: inelastic instability of a single member and progressive yielding. The feature variables in the data are uncertainties including material yield strength, Young’s modulus, frame sway imperfection, and residual stress. The Monte Carlo sampling method is utilized to generate random realizations of the frames from published distributions of the feature parameters, and the response variable is the frame ultimate strength obtained from finite element analyses. Decision trees are trained to identify important features. Feature rankings are derived by four feature selection techniques including impurity-based, permutation, SHAP, and Spearman's correlation. Predictive performance of the model including the important features are discussed using the evaluation metric for imbalanced datasets, Matthews correlation coefficient. Finally, the results are compared with those from reliability-based sensitivity analysis on the same example frames to show the validity of the feature selection approach. As the proposed machine learning-based approach produces the same results as the reliability-based sensitivity analysis with improved computational efficiency and accuracy, it could be extended to other structural systems.



2017 ◽  
Author(s):  
Vladimir Gligorijević ◽  
Meet Barot ◽  
Richard Bonneau

AbstractThe prevalence of high-throughput experimental methods has resulted in an abundance of large-scale molecular and functional interaction networks. The connectivity of these networks provide a rich source of information for inferring functional annotations for genes and proteins. An important challenge has been to develop methods for combining these heterogeneous networks to extract useful protein feature representations for function prediction. Most of the existing approaches for network integration use shallow models that cannot capture complex and highly-nonlinear network structures. Thus, we propose deepNF, a network fusion method based on Multimodal Deep Autoencoders to extract high-level features of proteins from multiple heterogeneous interaction networks. We apply this method to combine STRING networks to construct a common low-dimensional representation containing high-level protein features. We use separate layers for different network types in the early stages of the multimodal autoencoder, later connecting all the layers into a single bottleneck layer from which we extract features to predict protein function. We compare the cross-validation and temporal holdout predictive performance of our method with state-of-the-art methods, including the recently proposed method Mashup. Our results show that our method outperforms previous methods for both human and yeast STRING networks. We also show substantial improvement in the performance of our method in predicting GO terms of varying type and specificity.AvailabilitydeepNF is freely available at: https://github.com/VGligorijevic/deepNF



2020 ◽  
Author(s):  
Spencer G. Gordon ◽  
Lisa E. Kursel ◽  
Kewei Xu ◽  
Ofer Rog

AbstractDuring sexual reproduction the parental homologous chromosomes find each other (pair) and align along their lengths by integrating local sequence homology with large-scale contiguity, thereby allowing for precise exchange of genetic information. The Synaptonemal Complex (SC) is a conserved zipper-like structure that assembles between the homologous chromosomes. This phase-separated interface brings chromosomes together and regulates exchanges between them. However, the molecular mechanisms by which the SC carries out these functions remain poorly understood. Here we isolated and characterized two mutations in the dimerization interface in the middle of the SC zipper in C. elegans. The mutations perturb both chromosome alignment and the regulation of genetic exchanges. Underlying the chromosome-scale phenotypes are distinct alterations to the way SC subunits interact with one another. We propose that the SC brings homologous chromosomes together through two biophysical activities: obligate dimerization that prevents assembly on unpaired chromosomes; and a tendency to phase-separate that extends pairing interactions along the entire length of the chromosomes.



2020 ◽  
Author(s):  
Tobias Groß ◽  
Csaba Jeney ◽  
Darius Halm ◽  
Günter Finkenzeller ◽  
G. Björn Stark ◽  
...  

AbstractThe homogeneity of the genetically modified single-cells is a necessity for many applications such as cell line development, gene therapy, and tissue engineering and in particular for regenerative medical applications. The lack of tools to effectively isolate and characterize CRISPR/Cas9 engineered cells is considered as a significant bottleneck in these applications. Especially the incompatibility of protein detection technologies to confirm protein expression changes without a preconditional large-scale clonal expansion, creates a gridlock in many applications. To ameliorate the characterization of engineered cells, we propose an improved workflow, including single-cell printing/isolation technology based on fluorescent properties with high yield, a genomic edit screen (surveyor assay), mRNA rtPCR assessing altered gene expression and a versatile protein detection tool called emulsion-coupling to deliver a high-content, unified single-cell workflow. The workflow was exemplified by engineering and functionally validating RANKL knockout immortalized mesenchymal stem cells showing altered bone formation capacity of these cells. The resulting workflow is economical, without the requirement of large-scale clonal expansions of the cells with overall cloning efficiency above 30% of CRISPR/Cas9 edited cells. Nevertheless, as the single-cell clones are comprehensively characterized at an early, highly parallel phase of the development of cells including DNA, RNA, and protein levels, the workflow delivers a higher number of successfully edited cells for further characterization, lowering the chance of late failures in the development process.Author summaryI completed my undergraduate degree in biochemistry at the University of Ulm and finished my master's degree in pharmaceutical biotechnology at the University of Ulm and University of applied science of Biberach with a focus on biotechnology, toxicology and molecular biology. For my master thesis, I went to the University of Freiburg to the department of microsystems engineering, where I developed a novel workflow for cell line development. I stayed at the institute for my doctorate, but changed my scientific focus to the development of the emulsion coupling technology, which is a powerful tool for the quantitative and highly parallel measurement of protein and protein interactions. I am generally interested in being involved in the development of innovative molecular biological methods that can be used to gain new insights about biological issues. I am particularly curious to unravel the complex and often poorly understood protein interaction pathways that are the cornerstone of understanding cellular functionality and are a fundamental necessity to describe life mechanistically.



2020 ◽  
Author(s):  
Hugo Talibart ◽  
François Coste

AbstractBackgroundTo assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models (pHMM), which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use.ResultsWe introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between 3% and 20%) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time (1′37″ in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and PPalign without couplings. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean F1 score and finds significantly better alignments than HHalign and PPalign without couplings in some cases.ConclusionsThese results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign’s guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.



2021 ◽  
Vol 22 (16) ◽  
pp. 8958
Author(s):  
Phasit Charoenkwan ◽  
Chanin Nantasenamat ◽  
Md. Mehedi Hasan ◽  
Mohammad Ali Moni ◽  
Pietro Lio’ ◽  
...  

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides



Sign in / Sign up

Export Citation Format

Share Document