scoring matrices
Recently Published Documents


TOTAL DOCUMENTS

72
(FIVE YEARS 16)

H-INDEX

16
(FIVE YEARS 3)

2021 ◽  
Author(s):  
Johannes Linder ◽  
Alyssa La Fleur ◽  
Zibo Chen ◽  
Ajasja Ljubetič ◽  
David Baker ◽  
...  

AbstractSequence-based neural networks can learn to make accurate predictions from large biological datasets, but model interpretation remains challenging. Many existing feature attribution methods are optimized for continuous rather than discrete input patterns and assess individual feature importance in isolation, making them ill-suited for interpreting non-linear interactions in molecular sequences. Building on work in computer vision and natural language processing, we developed an approach based on deep generative modeling - Scrambler networks - wherein the most salient sequence positions are identified with learned input masks. Scramblers learn to generate Position-Specific Scoring Matrices (PSSMs) where unimportant nucleotides or residues are ‘scrambled’ by raising their entropy. We apply Scramblers to interpret the effects of genetic variants, uncover non-linear interactions between cis-regulatory elements, explain binding specificity for protein-protein interactions, and identify structural determinants of de novo designed proteins. We show that interpretation based on a generative model allows for efficient attribution across large datasets and results in high-quality explanations, often outperforming state-of-the-art methods.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Chintalapati Janaki ◽  
Venkatraman S. Gowri ◽  
Narayanaswamy Srinivasan

AbstractGenome sequencing projects unearth sequences of all the protein sequences encoded in a genome. As the first step, homology detection is employed to obtain clues to structure and function of these proteins. However, high evolutionary divergence between homologous proteins challenges our ability to detect distant relationships. In the past, an approach involving multiple Position Specific Scoring Matrices (PSSMs) was found to be more effective than traditional single PSSMs. Cascaded search is another successful approach where hits of a search are queried to detect more homologues. We propose a protocol, ‘Master Blaster’, which combines the principles adopted in these two approaches to enhance our ability to detect remote homologues even further. Assessment of the approach was performed using known relationships available in the SCOP70 database, and the results were compared against that of PSI-BLAST and HHblits, a hidden Markov model-based method. Compared to PSI-BLAST, Master Blaster resulted in 10% improvement with respect to detection of cross superfamily connections, nearly 35% improvement in cross family and more than 80% improvement in intra family connections. From the results it was observed that HHblits is more sensitive in detecting remote homologues compared to Master Blaster. However, there are true hits from 46-folds for which Master Blaster reported homologs that are not reported by HHblits even using the optimal parameters indicating that for detecting remote homologues, use of multiple methods employing a combination of different approaches can be more effective in detecting remote homologs. Master Blaster stand-alone code is available for download in the supplementary archive.


2021 ◽  
Vol 8 ◽  
Author(s):  
Rodrigo Ochoa ◽  
Roman A. Laskowski ◽  
Janet M. Thornton ◽  
Pilar Cossio

The prediction of peptide binders to Major Histocompatibility Complex (MHC) class II receptors is of great interest to study autoimmune diseases and for vaccine development. Most approaches predict the affinities using sequence-based models trained on experimental data and multiple alignments from known peptide substrates. However, detecting activity differences caused by single-point mutations is a challenging task. In this work, we used interactions calculated from simulations to build scoring matrices for quickly estimating binding differences by single-point mutations. We modelled a set of 837 peptides bound to an MHC class II allele, and optimized the sampling of the conformations using the Rosetta backrub method by comparing the results to molecular dynamics simulations. From the dynamic trajectories of each complex, we averaged and compared structural observables for each amino acid at each position of the 9°mer peptide core region. With this information, we generated the scoring-matrices to predict the sign of the binding differences. We then compared the performance of the best scoring-matrix to different computational methodologies that range in computational costs. Overall, the prediction of the activity differences caused by single mutated peptides was lower than 60% for all the methods. However, the developed scoring-matrix in combination with existing methods reports an increase in the performance, up to 86% with a scoring method that uses molecular dynamics.


Author(s):  
Louis Becquey ◽  
Eric Angel ◽  
Fariza Tahi

Abstract Motivation Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning. Results Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided. Availability and implementation The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 29 (11) ◽  
pp. 2150-2163
Author(s):  
Rakesh Trivedi ◽  
Hampapathalu Adimurthy Nagarajaram
Keyword(s):  

2020 ◽  
Vol 117 (41) ◽  
pp. 25464-25475
Author(s):  
Jie Zhou ◽  
Shantao Li ◽  
Kevin K. Leung ◽  
Brian O’Donovan ◽  
James Y. Zou ◽  
...  

Proteolysis is a major posttranslational regulator of biology inside and outside of cells. Broad identification of optimal cleavage sites and natural substrates of proteases is critical for drug discovery and to understand protease biology. Here, we present a method that employs two genetically encoded substrate phage display libraries coupled with next generation sequencing (SPD-NGS) that allows up to 10,000-fold deeper sequence coverage of the typical six- to eight-residue protease cleavage sites compared to state-of-the-art synthetic peptide libraries or proteomics. We applied SPD-NGS to two classes of proteases, the intracellular caspases, and the ectodomains of the sheddases, ADAMs 10 and 17. The first library (Lib 10AA) allowed us to identify 104to 105unique cleavage sites over a 1,000-fold dynamic range of NGS counts and produced consensus and optimal cleavage motifs based position-specific scoring matrices. A second SPD-NGS library (Lib hP), which displayed virtually the entire human proteome tiled in contiguous 49 amino acid sequences with 25 amino acid overlaps, enabled us to identify candidate human proteome sequences. We identified up to 104natural linear cut sites, depending on the protease, and captured most of the examples previously identified by proteomics and predicted 10- to 100-fold more. Structural bioinformatics was used to facilitate the identification of candidate natural protein substrates. SPD-NGS is rapid, reproducible, simple to perform and analyze, inexpensive, and renewable, with unprecedented depth of coverage for substrate sequences, and is an important tool for protease biologists interested in protease specificity for specific assays and inhibitors and to facilitate identification of natural protein substrates.


2020 ◽  
Vol 20 (21) ◽  
pp. 1888-1897
Author(s):  
Jian Zhang ◽  
Yu Zhang ◽  
Yanlin Li ◽  
Song Guo ◽  
Guifu Yang

Objective: Cancer is one of the most serious diseases affecting human health. Among all current cancer treatments, early diagnosis and control significantly help increase the chances of cure. Detecting cancer biomarkers in body fluids now is attracting more attention within oncologists. In-silico predictions of body fluid-related proteins, which can be served as cancer biomarkers, open a door for labor-intensive and time-consuming biochemical experiments. Methods: In this work, we propose a novel method for high-throughput identification of cancer biomarkers in human body fluids. We incorporate physicochemical properties into the weighted observed percentages (WOP) and position-specific scoring matrices (PSSM) profiles to enhance their attributes that reflect the evolutionary conservation of the body fluid-related proteins. The least absolute selection and shrinkage operator (LASSO) feature selection strategy is introduced to generate the optimal feature subset. Results: The ten-fold cross-validation results on training datasets demonstrate the accuracy of the proposed model. We also test our proposed method on independent testing datasets and apply it to the identification of potential cancer biomarkers in human body fluids. Conclusion: The testing results promise a good generalization capability of our approach.


2020 ◽  
Vol 3 (3) ◽  
pp. p47
Author(s):  
DJELLE Opely Patrice-Aime

This study examines the link between cyberdependency and school performance among students in the 3rd grade of the Mamie Houphouët Fêtai High School in Bingerville. It covers a sample of one hundred and ninety (190) female students between the ages of 14 and 17. Students’ addiction to the Internet and social networks is measured using a questionnaire based on Vavassori et al. (2002) and Young’s Internet Addiction Test in its French version validated by Khazaal (2008). As for academic performance, they are verified using the end-of-term scoring matrices. The results, obtained using student T and Anova, show that students in the third grade using the Internet as teaching tools have higher academic performance than their peers who use it as entertaining instruments. All these different results are explained by the models of Zuckerman (1969) and Viau (1994). Ultimately, this study will inform and raise awareness among students, educational system actors and parents about the risks of excessive use of the Internet and social networks on school learning.


2020 ◽  
Vol 58 (4) ◽  
Author(s):  
Marilynn A. Larson ◽  
Khalid Sayood ◽  
Amanda M. Bartling ◽  
Jennifer R. Meyer ◽  
Clarise Starr ◽  
...  

ABSTRACT The highly infectious and zoonotic pathogen Francisella tularensis is the etiologic agent of tularemia, a potentially fatal disease if untreated. Despite the high average nucleotide identity, which is >99.2% for the virulent subspecies and >98% for all four subspecies, including the opportunistic microbe Francisella tularensis subsp. novicida, there are considerable differences in genetic organization. These chromosomal disparities contribute to the substantial differences in virulence observed between the various F. tularensis subspecies and subtypes. The methods currently available to genotype F. tularensis cannot conclusively identify the associated subpopulation without using time-consuming testing or complex scoring matrices. To address this need, we developed both single and multiplex quantitative real-time PCR (qPCR) assays that can accurately detect and identify the hypervirulent F. tularensis subsp. tularensis subtype A.I, the virulent F. tularensis subsp. tularensis subtype A.II, F. tularensis subsp. holarctica (also referred to as type B), and F. tularensis subsp. mediasiatica, as well as opportunistic F. tularensis subsp. novicida from each other and near neighbors, such as Francisella philomiragia, Francisella persica, and Francisella-like endosymbionts found in ticks. These fluorescence-based singleplex and non-matrix scoring multiplex qPCR assays utilize a hydrolysis probe, providing sensitive and specific F. tularensis subspecies and subtype identification in a rapid manner. Furthermore, sequencing of the amplified F. tularensis targets provides clade confirmation and informative strain-specific details. Application of these qPCR- and sequencing-based detection assays will provide an improved capability for molecular typing and clinical diagnostics, as well as facilitate the accurate identification and differentiation of F. tularensis subpopulations during epidemiological investigations of tularemia source outbreaks.


2020 ◽  
Vol 36 (8) ◽  
pp. 2401-2409 ◽  
Author(s):  
Nils Strodthoff ◽  
Patrick Wagner ◽  
Markus Wenzel ◽  
Wojciech Samek

Abstract Motivation Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification are tailored to single classification tasks and rely on handcrafted features, such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and transferring it to specific tasks by a simple fine-tuning step. Results We put forward a universal deep sequence model that is pre-trained on unlabeled protein sequences from Swiss-Prot and fine-tuned on protein classification tasks. We apply it to three prototypical tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection. The proposed method performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them. These results stress the possibility of inferring protein properties from the sequence alone and, on more general grounds, the prospects of modern natural language processing methods in omics. Moreover, we illustrate the prospects for explainable machine learning methods in this field by selected case studies. Availability and implementation Source code is available under https://github.com/nstrodt/UDSMProt. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document