ASAP: A Machine-Learning Framework for Local Protein Properties

Determining residue level protein properties, such as the sites for post-translational modifications (PTMs) are vital to understanding proteins at all levels of function. Experimental methods are costly and time-consuming, thus high confidence predictions become essential for functional knowledge at a genomic scale. Traditional computational methods based on strict rules (e.g. regular expressions) fail to annotate sites that lack substantial similarity. Thus, Machine Learning (ML) methods become fundamental in annotating proteins with unknown function. We present ASAP (Amino-acid Sequence Annotation Prediction), a universal ML framework for residue-level predictions. ASAP extracts efficiently and fast large set of window-based features from raw sequences. The platform also supports easy integration of external features such as secondary structure or PSSM profiles. The features are then combined to train underlying ML classifiers. We present a detailed case study for ASAP that was used to train CleavePred, a state-of-the-art protein precursor cleavage sites predictor. Protein cleavage is a fundamental PTM shared by a wide variety of protein groups with minimal sequence similarity. Current computational methods have high false positive rates, making them suboptimal for this task. CleavePred has a simple Python API, and is freely accessible via a web- based application. The high performance of ASAP toward the task of precursor cleavage is suited for analyzing new proteomes at a genomic scale. The tool is attractive to protein design, mass spectrometry search engines and the discovery of new peptide hormones. In summary, we illustrate ASAP as an entry point for predicting PTMs. The approach and flexibility of the platform can easily be extended for additional residue specific tasks. ASAP and CleavePred source code available at https://github.com/ddofer/asap.

Download Full-text

Adsorption Isotherm Predictions for Multiple Molecules in MOFs Using the Same Deep Learning Model

10.26434/chemrxiv.9894224.v1 ◽

2019 ◽

Author(s):

Ryther Anderson ◽

Achay Biong ◽

Diego Gómez-Gualdrón

Keyword(s):

Neural Network ◽

Machine Learning ◽

Molecular Simulation ◽

Large Scale ◽

Learning Model ◽

Operating Conditions ◽

Small Subset ◽

Screening Methods ◽

Large Set ◽

Metal Organic

<div>Tailoring the structure and chemistry of metal-organic frameworks (MOFs) enables the manipulation of their adsorption properties to suit specific energy and environmental applications. As there are millions of possible MOFs (with tens of thousands already synthesized), molecular simulation, such as grand canonical Monte Carlo (GCMC), has frequently been used to rapidly evaluate the adsorption performance of a large set of MOFs. This allows subsequent experiments to focus only on a small subset of the most promising MOFs. In many instances, however, even molecular simulation becomes prohibitively time consuming, underscoring the need for alternative screening methods, such as machine learning, to precede molecular simulation efforts. In this study, as a proof of concept, we trained a neural network as the first example of a machine learning model capable of predicting full adsorption isotherms of different molecules not included in the training of the model. To achieve this, we trained our neural network only on alchemical species, represented only by their geometry and force field parameters, and used this neural network to predict the loadings of real adsorbates. We focused on predicting room temperature adsorption of small (one- and two-atom) molecules relevant to chemical separations. Namely, argon, krypton, xenon, methane, ethane, and nitrogen. However, we also observed surprisingly promising predictions for more complex molecules, whose properties are outside the range spanned by the alchemical adsorbates. Prediction accuracies suitable for large-scale screening were achieved using simple MOF (e.g. geometric properties and chemical moieties), and adsorbate (e.g. forcefield parameters and geometry) descriptors. Our results illustrate a new philosophy of training that opens the path towards development of machine learning models that can predict the adsorption loading of any new adsorbate at any new operating conditions in any new MOF.</div>

Download Full-text

A Brief Survey of Machine Learning Methods in Identification of Mitochondria Proteins in Malaria Parasite

Current Pharmaceutical Design ◽

10.2174/1381612826666200310122324 ◽

2020 ◽

Vol 26 (26) ◽

pp. 3049-3058

Author(s):

Ting Liu ◽

Hua Tang

Keyword(s):

Machine Learning ◽

Computational Methods ◽

Future Development ◽

Malaria Parasite ◽

Mitochondrial Proteins ◽

Learning Methods ◽

Machine Learning Methods ◽

Effective Drugs ◽

Construction Strategies ◽

Day By Day

The number of human deaths caused by malaria is increasing day-by-day. In fact, the mitochondrial proteins of the malaria parasite play vital roles in the organism. For developing effective drugs and vaccines against infection, it is necessary to accurately identify mitochondrial proteins of the malaria parasite. Although precise details for the mitochondrial proteins can be provided by biochemical experiments, they are expensive and time-consuming. In this review, we summarized the machine learning-based methods for mitochondrial proteins identification in the malaria parasite and compared the construction strategies of these computational methods. Finally, we also discussed the future development of mitochondrial proteins recognition with algorithms.

Download Full-text

Microstructures in Elastic Media

10.1093/oso/9780195090864.001.0001 ◽

1994 ◽

Cited By ~ 1

Author(s):

Nhan Phan-Thien ◽

Sangtae Kim

Keyword(s):

Computational Methods ◽

High Performance ◽

Chemical Engineering ◽

Elliptic Partial Differential Equations ◽

Parallel Computers ◽

Computational Algorithms ◽

Applied Physics ◽

Advanced Students ◽

Particulate Solids ◽

Modeling Software

This monograph describes various methods for solving deformation problems of particulate solids, taking the reader from analytical to computational methods. The book is the first to present the topic of linear elasticity in mathematical terms that will be familiar to anyone with a grounding in fluid mechanics. It incorporates the latest advances in computational algorithms for elliptic partial differential equations, and provides the groundwork for simulations on high performance parallel computers. Numerous exercises complement the theoretical discussions, and a related set of self-documented programs is available to readers with Internet access. The work will be of interest to advanced students and practicing researchers in mechanical engineering, chemical engineering, applied physics, computational methods, and developers of numerical modeling software.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

Use of Machine Learning to Investigate the Quantitative Checklist for Autism in Toddlers (Q-CHAT) towards Early Autism Screening

Diagnostics ◽

10.3390/diagnostics11030574 ◽

2021 ◽

Vol 11 (3) ◽

pp. 574

Author(s):

Gennaro Tartarisco ◽

Giovanni Cicceri ◽

Davide Di Pietro ◽

Elisa Leonardi ◽

Stefania Aiello ◽

...

Keyword(s):

Machine Learning ◽

High Performance ◽

Behavioral Science ◽

Autistic Traits ◽

Classification Performance ◽

Recursive Feature Elimination ◽

Diagnostic Tools ◽

Support Vector ◽

K Nearest Neighbors ◽

Autism Screening

In the past two decades, several screening instruments were developed to detect toddlers who may be autistic both in clinical and unselected samples. Among others, the Quantitative CHecklist for Autism in Toddlers (Q-CHAT) is a quantitative and normally distributed measure of autistic traits that demonstrates good psychometric properties in different settings and cultures. Recently, machine learning (ML) has been applied to behavioral science to improve the classification performance of autism screening and diagnostic tools, but mainly in children, adolescents, and adults. In this study, we used ML to investigate the accuracy and reliability of the Q-CHAT in discriminating young autistic children from those without. Five different ML algorithms (random forest (RF), naïve Bayes (NB), support vector machine (SVM), logistic regression (LR), and K-nearest neighbors (KNN)) were applied to investigate the complete set of Q-CHAT items. Our results showed that ML achieved an overall accuracy of 90%, and the SVM was the most effective, being able to classify autism with 95% accuracy. Furthermore, using the SVM–recursive feature elimination (RFE) approach, we selected a subset of 14 items ensuring 91% accuracy, while 83% accuracy was obtained from the 3 best discriminating items in common to ours and the previously reported Q-CHAT-10. This evidence confirms the high performance and cross-cultural validity of the Q-CHAT, and supports the application of ML to create shorter and faster versions of the instrument, maintaining high classification accuracy, to be used as a quick, easy, and high-performance tool in primary-care settings.

Download Full-text

Automatic generation of high-performance quantized machine learning kernels

Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization ◽

10.1145/3368826.3377912 ◽

2020 ◽

Cited By ~ 1

Author(s):

Meghan Cowan ◽

Thierry Moreau ◽

Tianqi Chen ◽

James Bornholt ◽

Luis Ceze

Keyword(s):

Machine Learning ◽

High Performance ◽

Automatic Generation

Download Full-text

An IoT-Focused Intrusion Detection System Approach Based on Preprocessing Characterization for Cybersecurity Datasets

Sensors ◽

10.3390/s21020656 ◽

2021 ◽

Vol 21 (2) ◽

pp. 656

Author(s):

Xavier Larriva-Novo ◽

Víctor A. Villagrá ◽

Mario Vega-Barbas ◽

Diego Rivera ◽

Mario Sanz Rodrigo

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

High Performance ◽

Learning Algorithm ◽

Detection System ◽

Machine Learning Algorithms ◽

Statistical Characteristics ◽

Detection Techniques ◽

Traffic Characteristics ◽

Benchmark Datasets

Security in IoT networks is currently mandatory, due to the high amount of data that has to be handled. These systems are vulnerable to several cybersecurity attacks, which are increasing in number and sophistication. Due to this reason, new intrusion detection techniques have to be developed, being as accurate as possible for these scenarios. Intrusion detection systems based on machine learning algorithms have already shown a high performance in terms of accuracy. This research proposes the study and evaluation of several preprocessing techniques based on traffic categorization for a machine learning neural network algorithm. This research uses for its evaluation two benchmark datasets, namely UGR16 and the UNSW-NB15, and one of the most used datasets, KDD99. The preprocessing techniques were evaluated in accordance with scalar and normalization functions. All of these preprocessing models were applied through different sets of characteristics based on a categorization composed by four groups of features: basic connection features, content characteristics, statistical characteristics and finally, a group which is composed by traffic-based features and connection direction-based traffic characteristics. The objective of this research is to evaluate this categorization by using various data preprocessing techniques to obtain the most accurate model. Our proposal shows that, by applying the categorization of network traffic and several preprocessing techniques, the accuracy can be enhanced by up to 45%. The preprocessing of a specific group of characteristics allows for greater accuracy, allowing the machine learning algorithm to correctly classify these parameters related to possible attacks.

Download Full-text

Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins

Scientific Reports ◽

10.1038/s41598-021-81063-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Dimitri Boeckaerts ◽

Michiel Stock ◽

Bjorn Criel ◽

Hans Gerstmans ◽

Bernard De Baets ◽

...

Keyword(s):

Machine Learning ◽

Predictive Model ◽

Receptor Binding ◽

Bacterial Infections ◽

Sequence Data ◽

Sequence Similarity ◽

Area Under The Curve ◽

Local Alignment ◽

Search Tool ◽

Different Levels

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.

Download Full-text