Assessing the behavior of machine learning methods to predict the activity of antimicrobial peptides

This study demonstrates the importance of obtaining statistically stable results when using machine learning methods to predict the activity of antimicrobial peptides, due to the cost and complexity of the chemical processes involved in cases where datasets are particularly small (less than a few hundred instances). Like in other fields with similar problems, this results in large variability in the performance of predictive models, hindering any attempt to transfer them to lab practice. Rather than targeting good peak performance obtained from very particular experimental setups, as reported in related literature, we focused on characterizing the behavior of the machine learning methods, as a preliminary step to obtain reproducible results across experimental setups, and, ultimately, good performance. We propose a methodology that integrates feature learning (autoencoders) and selection methods (genetic algorithms) thorough the exhaustive use of performance metrics (permutation tests and bootstrapping), which provide stronger statistical evidence to support investment decisions with the lab resources at hand. We show evidence for the usefulness of 1) the extensive use of computational resources, and 2) adopting a wider range of metrics than those reported in the literature to assess method performance. This approach allowed us to guide our quest for finding suitable machine learning methods, and to obtain results comparable to those in the literature with strong statistical stability.

Download Full-text

Machine Learning Methods Applied to the Prediction of Pseudo-nitzschia spp. Blooms in the Galician Rias Baixas (NW Spain)

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10040199 ◽

2021 ◽

Vol 10 (4) ◽

pp. 199

Author(s):

Francisco M. Bellas Aláez ◽

Jesus M. Torres Palenzuela ◽

Evangelos Spyrakos ◽

Luis González Vilas

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Prediction Models ◽

Support Vector ◽

False Alarms ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Rías Baixas ◽

New Algorithms

This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas. This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and specificity. Classical machine learning approaches show higher sensitivities, but at a cost of lower specificity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.

Download Full-text

Multi-label pathway prediction based on active dataset subsampling

10.1101/2020.09.14.297424 ◽

2020 ◽

Author(s):

Abdur Rahman M. A. Basher ◽

Steven J. Hallam

Keyword(s):

Machine Learning ◽

Microbial Communities ◽

Performance Metrics ◽

Class Imbalance ◽

Training Data ◽

Great Promise ◽

Biological Organization ◽

Learning Methods ◽

Machine Learning Methods ◽

Pathway Prediction

AbstractMachine learning methods show great promise in predicting metabolic pathways at different levels of biological organization. However, several complications remain that can degrade prediction performance including inadequately labeled training data, missing feature information, and inherent imbalances in the distribution of enzymes and pathways within a dataset. This class imbalance problem is commonly encountered by the machine learning community when the proportion of instances over class labels within a dataset are uneven, resulting in poor predictive performance for underrepresented classes. Here, we present leADS, multi-label learning based on active dataset subsampling, that leverages the idea of subsampling points from a pool of data to reduce the negative impact of training loss due to class imbalance. Specifically, leADS performs an iterative process to: (i)-construct an acquisition model in an ensemble framework; (ii) select informative points using an appropriate acquisition function; and (iii) train on selected samples. Multiple base learners are implemented in parallel where each is assigned a portion of labeled training data to learn pathways. We benchmark leADS using a corpora of 10 experimental datasets manifesting diverse multi-label properties used in previous pathway prediction studies, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previously reported machine learning methods for both organismal and multi-organismal genomes while establishing an extensible framework for navigating class imbalances across diverse real world datasets.Availability and implementationThe software package, and installation instructions are published on github.com/[email protected]

Download Full-text

Performance Evaluation of Machine Learning Methods for Forest Fire Modeling and Prediction

Symmetry ◽

10.3390/sym12061022 ◽

2020 ◽

Vol 12 (6) ◽

pp. 1022 ◽

Cited By ~ 11

Author(s):

Binh Thai Pham ◽

Abolfazl Jaafari ◽

Mohammadtaghi Avand ◽

Nadhir Al-Ansari ◽

Tran Dinh Du ◽

...

Keyword(s):

Machine Learning ◽

National Park ◽

Performance Metrics ◽

Characteristic Curve ◽

Residential Areas ◽

Learning Methods ◽

Explanatory Variables ◽

Machine Learning Methods ◽

Modeling Methodology ◽

In Fire

Predicting and mapping fire susceptibility is a top research priority in fire-prone forests worldwide. This study evaluates the abilities of the Bayes Network (BN), Naïve Bayes (NB), Decision Tree (DT), and Multivariate Logistic Regression (MLP) machine learning methods for the prediction and mapping fire susceptibility across the Pu Mat National Park, Nghe An Province, Vietnam. The modeling methodology was formulated based on processing the information from the 57 historical fires and a set of nine spatially explicit explanatory variables, namely elevation, slope degree, aspect, average annual temperate, drought index, river density, land cover, and distance from roads and residential areas. Using the area under the receiver operating characteristic curve (AUC) and seven other performance metrics, the models were validated in terms of their abilities to elucidate the general fire behaviors in the Pu Mat National Park and to predict future fires. Despite a few differences between the AUC values, the BN model with an AUC value of 0.96 was dominant over the other models in predicting future fires. The second best was the DT model (AUC = 0.94), followed by the NB (AUC = 0.939), and MLR (AUC = 0.937) models. Our robust analysis demonstrated that these models are sufficiently robust in response to the training and validation datasets change. Further, the results revealed that moderate to high levels of fire susceptibilities are associated with ~19% of the Pu Mat National Park where human activities are numerous. This study and the resultant susceptibility maps provide a basis for developing more efficient fire-fighting strategies and reorganizing policies in favor of sustainable management of forest resources.

Download Full-text

An Overview of Opportunities for Machine Learning Methods in Underground Rock Engineering Design

Geosciences ◽

10.3390/geosciences9120504 ◽

2019 ◽

Vol 9 (12) ◽

pp. 504

Author(s):

Josephine Morgenroth ◽

Usman T. Khan ◽

Matthew A. Perras

Keyword(s):

Machine Learning ◽

Engineering Design ◽

Performance Metrics ◽

Mining Industry ◽

Learning Methods ◽

Rock Engineering ◽

Input Selection ◽

Machine Learning Methods ◽

And Performance ◽

Rock Engineering Design

Machine learning methods for data processing are gaining momentum in many geoscience industries. This includes the mining industry, where machine learning is primarily being applied to autonomously driven vehicles such as haul trucks, and ore body and resource delineation. However, the development of machine learning applications in rock engineering literature is relatively recent, despite being widely used and generally accepted for decades in other risk assessment-type design areas, such as flood forecasting. Operating mines and underground infrastructure projects collect more instrumentation data than ever before, however, only a small fraction of the useful information is typically extracted for rock engineering design, and there is often insufficient time to investigate complex rock mass phenomena in detail. This paper presents a summary of current practice in rock engineering design, as well as a review of literature and methods at the intersection of machine learning and rock engineering. It identifies gaps, such as standards for architecture, input selection and performance metrics, and areas for future work. These gaps present an opportunity to define a framework for integrating machine learning into conventional rock engineering design methodologies to make them more rigorous and reliable in predicting probable underlying physical mechanics and phenomenon.

Download Full-text

Modeling Traders’ Behavior with Deep Learning and Machine Learning Methods: Evidence from BIST 100 Index

Complexity ◽

10.1155/2020/8285149 ◽

2020 ◽

Vol 2020 ◽

pp. 1-16

Author(s):

Afan Hasan ◽

Oya Kalıpsız ◽

Selim Akyokuş

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Financial Market ◽

Performance Metrics ◽

Confusion Matrix ◽

Support Vector ◽

Human Beings ◽

Learning Methods ◽

Technical Indicators ◽

Machine Learning Methods

Although the vast majority of fundamental analysts believe that technical analysts’ estimates and technical indicators used in these analyses are unresponsive, recent research has revealed that both professionals and individual traders are using technical indicators. A correct estimate of the direction of the financial market is a very challenging activity, primarily due to the nonlinear nature of the financial time series. Deep learning and machine learning methods on the other hand have achieved very successful results in many different areas where human beings are challenged. In this study, technical indicators were integrated into the methods of deep learning and machine learning, and the behavior of the traders was modeled in order to increase the accuracy of forecasting of the financial market direction. A set of technical indicators has been examined based on their application in technical analysis as input features to predict the oncoming (one-period-ahead) direction of Istanbul Stock Exchange (BIST100) national index. To predict the direction of the index, Deep Neural Network (DNN), Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR) classification techniques are used. The performance of these models is evaluated on the basis of various performance metrics such as confusion matrix, compound return, and max drawdown.

Download Full-text

Substrate specificity of 2-deoxy-D-ribose 5-phosphate aldolase (DERA) assessed by different protein engineering and machine learning methods

Applied Microbiology and Biotechnology ◽

10.1007/s00253-020-10960-x ◽

2020 ◽

Vol 104 (24) ◽

pp. 10515-10529

Author(s):

Sanni Voutilainen ◽

Markus Heinonen ◽

Martina Andberg ◽

Emmi Jokinen ◽

Hannu Maaheimo ◽

...

Keyword(s):

Machine Learning ◽

Amino Acid ◽

Protein Engineering ◽

Specific Activity ◽

Complex Structure ◽

Feature Learning ◽

Wild Type ◽

Learning Methods ◽

Clear Cut ◽

Machine Learning Methods

Abstract In this work, deoxyribose-5-phosphate aldolase (Ec DERA, EC 4.1.2.4) from Escherichia coli was chosen as the protein engineering target for improving the substrate preference towards smaller, non-phosphorylated aldehyde donor substrates, in particular towards acetaldehyde. The initial broad set of mutations was directed to 24 amino acid positions in the active site or in the close vicinity, based on the 3D complex structure of the E. coli DERA wild-type aldolase. The specific activity of the DERA variants containing one to three amino acid mutations was characterised using three different substrates. A novel machine learning (ML) model utilising Gaussian processes and feature learning was applied for the 3rd mutagenesis round to predict new beneficial mutant combinations. This led to the most clear-cut (two- to threefold) improvement in acetaldehyde (C2) addition capability with the concomitant abolishment of the activity towards the natural donor molecule glyceraldehyde-3-phosphate (C3P) as well as the non-phosphorylated equivalent (C3). The Ec DERA variants were also tested on aldol reaction utilising formaldehyde (C1) as the donor. Ec DERA wild-type was shown to be able to carry out this reaction, and furthermore, some of the improved variants on acetaldehyde addition reaction turned out to have also improved activity on formaldehyde. Key points • DERA aldolases are promiscuous enzymes. • Synthetic utility of DERA aldolase was improved by protein engineering approaches. • Machine learning methods aid the protein engineering of DERA.

Download Full-text

Comprehensive assessment of machine learning-based methods for predicting antimicrobial peptides

Briefings in Bioinformatics ◽

10.1093/bib/bbab083 ◽

2021 ◽

Author(s):

Jing Xu ◽

Fuyi Li ◽

André Leier ◽

Dongxu Xiang ◽

Hsin-Hui Shen ◽

...

Keyword(s):

Machine Learning ◽

Antimicrobial Peptides ◽

Computational Methods ◽

Cross Validation ◽

Predictive Performance ◽

Support Vector ◽

Data Sets ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods

Abstract Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.

Download Full-text

Study on Evaluation of Machine Learning Approaches in Brain Tumour MR Images

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i5.2028 ◽

2021 ◽

Vol 12 (5) ◽

pp. 1361-1371

Author(s):

Nisha Joseph, Et. al.

Keyword(s):

Neural Network ◽

Machine Learning ◽

Brain Tumour ◽

Performance Metrics ◽

Dice Similarity Coefficient ◽

Learning Methods ◽

Machine Learning Methods ◽

Supervised Segmentation ◽

Segmentation Methods ◽

Segmentation Approach

The principal intention of this work is to compare the performance of the supervised brain tumour segmentation methods. These segmentation methods are based on machine learning. First, the input MR brain image is denoised by employing the adaptive bilateral filter, and the image contrast is enhanced employing the histogram equalization. Then we retrieve the features from the pre-processed image. Among several feature extraction methods, this work uses the shape, intensity, and texture feature extractors. Subsequent to removing these three types of features, fragment the tumor dependent on these recovered segments. The supervised segmentation approach is used for this. Among several supervised segmentation methods, this work uses three machine learning methods, namely Probabilistic Neural Network (PNN), Artificial Neural Network (ANN), and Convolution Neural Network (CNN). Finally, the retrieved features are feed into these machine learning methods to segment the brain tumour regions. To find out the best machine learning approach, the performance of these three supervised machines learning methods is evaluated by four performance metrics. Based on these evaluations, the best segmentation approach is discovered. Four execution boundaries are utilized, in particular, Dice Similarity Coefficient (DSC), Positive Predictive Value (PPV), Jaccard list (JI), and Sensitivity (SEN) to analyze the presentation of the AI strategy. The experimental outputs exposed that the CNN makes greater than other methods.

Download Full-text

Deep learning classification of lipid droplets in quantitative phase images

PLoS ONE ◽

10.1371/journal.pone.0249196 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0249196

Author(s):

Luke Sheneman ◽

Gregory Stephanopoulos ◽

Andreas E. Vasdekis

Keyword(s):

Machine Learning ◽

Lipid Droplets ◽

Supervised Machine Learning ◽

Label Free ◽

Learning Methods ◽

Method Performance ◽

Quantitative Phase ◽

Machine Learning Methods ◽

Phase Images

We report the application of supervised machine learning to the automated classification of lipid droplets in label-free, quantitative-phase images. By comparing various machine learning methods commonly used in biomedical imaging and remote sensing, we found convolutional neural networks to outperform others, both quantitatively and qualitatively. We describe our imaging approach, all implemented machine learning methods, and their performance with respect to computational efficiency, required training resources, and relative method performance measured across multiple metrics. Overall, our results indicate that quantitative-phase imaging coupled to machine learning enables accurate lipid droplet classification in single living cells. As such, the present paradigm presents an excellent alternative of the more common fluorescent and Raman imaging modalities by enabling label-free, ultra-low phototoxicity, and deeper insight into the thermodynamics of metabolism of single cells.

Download Full-text

Advanced Interpretable Machine Learning Methods for Clinical NGS Big Data of Complex Hereditary Diseases

10.3389/978-2-88966-274-6 ◽

2020 ◽

Keyword(s):

Machine Learning ◽

Big Data ◽

Hereditary Diseases ◽

Learning Methods ◽

Machine Learning Methods ◽

Interpretable Machine Learning

Download Full-text