scholarly journals Assessing the behavior of machine learning methods to predict the activity of antimicrobial peptides

2016 ◽  
Vol 26 (44) ◽  
pp. 167
Author(s):  
Francy Liliana Camacho ◽  
Rodrigo Torres-Sáez ◽  
Raúl Ramos-Pollán

This study demonstrates the importance of obtaining statistically stable results when using machine learning methods to predict the activity of antimicrobial peptides, due to the cost and complexity of the chemical processes involved in cases where datasets are particularly small (less than a few hundred instances). Like in other fields with similar problems, this results in large variability in the performance of predictive models, hindering any attempt to transfer them to lab practice. Rather than targeting good peak performance obtained from very particular experimental setups, as reported in related literature, we focused on characterizing the behavior of the machine learning methods, as a preliminary step to obtain reproducible results across experimental setups, and, ultimately, good performance. We propose a methodology that integrates feature learning (autoencoders) and selection methods (genetic algorithms) thorough the exhaustive use of performance metrics (permutation tests and bootstrapping), which provide stronger statistical evidence to support investment decisions with the lab resources at hand. We show evidence for the usefulness of 1) the extensive use of computational resources, and 2) adopting a wider range of metrics than those reported in the literature to assess method performance. This approach allowed us to guide our quest for finding suitable machine learning methods, and to obtain results comparable to those in the literature with strong statistical stability.

2021 ◽  
Vol 10 (4) ◽  
pp. 199
Author(s):  
Francisco M. Bellas Aláez ◽  
Jesus M. Torres Palenzuela ◽  
Evangelos Spyrakos ◽  
Luis González Vilas

This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas. This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and specificity. Classical machine learning approaches show higher sensitivities, but at a cost of lower specificity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.


2020 ◽  
Author(s):  
Abdur Rahman M. A. Basher ◽  
Steven J. Hallam

AbstractMachine learning methods show great promise in predicting metabolic pathways at different levels of biological organization. However, several complications remain that can degrade prediction performance including inadequately labeled training data, missing feature information, and inherent imbalances in the distribution of enzymes and pathways within a dataset. This class imbalance problem is commonly encountered by the machine learning community when the proportion of instances over class labels within a dataset are uneven, resulting in poor predictive performance for underrepresented classes. Here, we present leADS, multi-label learning based on active dataset subsampling, that leverages the idea of subsampling points from a pool of data to reduce the negative impact of training loss due to class imbalance. Specifically, leADS performs an iterative process to: (i)-construct an acquisition model in an ensemble framework; (ii) select informative points using an appropriate acquisition function; and (iii) train on selected samples. Multiple base learners are implemented in parallel where each is assigned a portion of labeled training data to learn pathways. We benchmark leADS using a corpora of 10 experimental datasets manifesting diverse multi-label properties used in previous pathway prediction studies, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previously reported machine learning methods for both organismal and multi-organismal genomes while establishing an extensible framework for navigating class imbalances across diverse real world datasets.Availability and implementationThe software package, and installation instructions are published on github.com/[email protected]


Symmetry ◽  
2020 ◽  
Vol 12 (6) ◽  
pp. 1022 ◽  
Author(s):  
Binh Thai Pham ◽  
Abolfazl Jaafari ◽  
Mohammadtaghi Avand ◽  
Nadhir Al-Ansari ◽  
Tran Dinh Du ◽  
...  

Predicting and mapping fire susceptibility is a top research priority in fire-prone forests worldwide. This study evaluates the abilities of the Bayes Network (BN), Naïve Bayes (NB), Decision Tree (DT), and Multivariate Logistic Regression (MLP) machine learning methods for the prediction and mapping fire susceptibility across the Pu Mat National Park, Nghe An Province, Vietnam. The modeling methodology was formulated based on processing the information from the 57 historical fires and a set of nine spatially explicit explanatory variables, namely elevation, slope degree, aspect, average annual temperate, drought index, river density, land cover, and distance from roads and residential areas. Using the area under the receiver operating characteristic curve (AUC) and seven other performance metrics, the models were validated in terms of their abilities to elucidate the general fire behaviors in the Pu Mat National Park and to predict future fires. Despite a few differences between the AUC values, the BN model with an AUC value of 0.96 was dominant over the other models in predicting future fires. The second best was the DT model (AUC = 0.94), followed by the NB (AUC = 0.939), and MLR (AUC = 0.937) models. Our robust analysis demonstrated that these models are sufficiently robust in response to the training and validation datasets change. Further, the results revealed that moderate to high levels of fire susceptibilities are associated with ~19% of the Pu Mat National Park where human activities are numerous. This study and the resultant susceptibility maps provide a basis for developing more efficient fire-fighting strategies and reorganizing policies in favor of sustainable management of forest resources.


Geosciences ◽  
2019 ◽  
Vol 9 (12) ◽  
pp. 504
Author(s):  
Josephine Morgenroth ◽  
Usman T. Khan ◽  
Matthew A. Perras

Machine learning methods for data processing are gaining momentum in many geoscience industries. This includes the mining industry, where machine learning is primarily being applied to autonomously driven vehicles such as haul trucks, and ore body and resource delineation. However, the development of machine learning applications in rock engineering literature is relatively recent, despite being widely used and generally accepted for decades in other risk assessment-type design areas, such as flood forecasting. Operating mines and underground infrastructure projects collect more instrumentation data than ever before, however, only a small fraction of the useful information is typically extracted for rock engineering design, and there is often insufficient time to investigate complex rock mass phenomena in detail. This paper presents a summary of current practice in rock engineering design, as well as a review of literature and methods at the intersection of machine learning and rock engineering. It identifies gaps, such as standards for architecture, input selection and performance metrics, and areas for future work. These gaps present an opportunity to define a framework for integrating machine learning into conventional rock engineering design methodologies to make them more rigorous and reliable in predicting probable underlying physical mechanics and phenomenon.


Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-16
Author(s):  
Afan Hasan ◽  
Oya Kalıpsız ◽  
Selim Akyokuş

Although the vast majority of fundamental analysts believe that technical analysts’ estimates and technical indicators used in these analyses are unresponsive, recent research has revealed that both professionals and individual traders are using technical indicators. A correct estimate of the direction of the financial market is a very challenging activity, primarily due to the nonlinear nature of the financial time series. Deep learning and machine learning methods on the other hand have achieved very successful results in many different areas where human beings are challenged. In this study, technical indicators were integrated into the methods of deep learning and machine learning, and the behavior of the traders was modeled in order to increase the accuracy of forecasting of the financial market direction. A set of technical indicators has been examined based on their application in technical analysis as input features to predict the oncoming (one-period-ahead) direction of Istanbul Stock Exchange (BIST100) national index. To predict the direction of the index, Deep Neural Network (DNN), Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR) classification techniques are used. The performance of these models is evaluated on the basis of various performance metrics such as confusion matrix, compound return, and max drawdown.


2020 ◽  
Vol 104 (24) ◽  
pp. 10515-10529
Author(s):  
Sanni Voutilainen ◽  
Markus Heinonen ◽  
Martina Andberg ◽  
Emmi Jokinen ◽  
Hannu Maaheimo ◽  
...  

Abstract In this work, deoxyribose-5-phosphate aldolase (Ec DERA, EC 4.1.2.4) from Escherichia coli was chosen as the protein engineering target for improving the substrate preference towards smaller, non-phosphorylated aldehyde donor substrates, in particular towards acetaldehyde. The initial broad set of mutations was directed to 24 amino acid positions in the active site or in the close vicinity, based on the 3D complex structure of the E. coli DERA wild-type aldolase. The specific activity of the DERA variants containing one to three amino acid mutations was characterised using three different substrates. A novel machine learning (ML) model utilising Gaussian processes and feature learning was applied for the 3rd mutagenesis round to predict new beneficial mutant combinations. This led to the most clear-cut (two- to threefold) improvement in acetaldehyde (C2) addition capability with the concomitant abolishment of the activity towards the natural donor molecule glyceraldehyde-3-phosphate (C3P) as well as the non-phosphorylated equivalent (C3). The Ec DERA variants were also tested on aldol reaction utilising formaldehyde (C1) as the donor. Ec DERA wild-type was shown to be able to carry out this reaction, and furthermore, some of the improved variants on acetaldehyde addition reaction turned out to have also improved activity on formaldehyde. Key points • DERA aldolases are promiscuous enzymes. • Synthetic utility of DERA aldolase was improved by protein engineering approaches. • Machine learning methods aid the protein engineering of DERA.


Author(s):  
Jing Xu ◽  
Fuyi Li ◽  
André Leier ◽  
Dongxu Xiang ◽  
Hsin-Hui Shen ◽  
...  

Abstract Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.


Author(s):  
Nisha Joseph, Et. al.

The principal intention of this work is to compare the performance of the supervised brain tumour segmentation methods. These segmentation methods are based on machine learning. First, the input MR brain image is denoised by employing the adaptive bilateral filter, and the image contrast is enhanced employing the histogram equalization. Then we retrieve the features from the pre-processed image. Among several feature extraction methods, this work uses the shape, intensity, and texture feature extractors. Subsequent to removing these three types of features, fragment the tumor dependent on these recovered segments. The supervised segmentation approach is used for this. Among several supervised segmentation methods, this work uses three machine learning methods, namely Probabilistic Neural Network (PNN), Artificial Neural Network (ANN), and Convolution Neural Network (CNN). Finally, the retrieved features are feed into these machine learning methods to segment the brain tumour regions. To find out the best machine learning approach, the performance of these three supervised machines learning methods is evaluated by four performance metrics. Based on these evaluations, the best segmentation approach is discovered. Four execution boundaries are utilized, in particular, Dice Similarity Coefficient (DSC), Positive Predictive Value (PPV), Jaccard list (JI), and Sensitivity (SEN) to analyze the presentation of the AI strategy. The experimental outputs exposed that the CNN makes greater than other methods.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0249196
Author(s):  
Luke Sheneman ◽  
Gregory Stephanopoulos ◽  
Andreas E. Vasdekis

We report the application of supervised machine learning to the automated classification of lipid droplets in label-free, quantitative-phase images. By comparing various machine learning methods commonly used in biomedical imaging and remote sensing, we found convolutional neural networks to outperform others, both quantitatively and qualitatively. We describe our imaging approach, all implemented machine learning methods, and their performance with respect to computational efficiency, required training resources, and relative method performance measured across multiple metrics. Overall, our results indicate that quantitative-phase imaging coupled to machine learning enables accurate lipid droplet classification in single living cells. As such, the present paradigm presents an excellent alternative of the more common fluorescent and Raman imaging modalities by enabling label-free, ultra-low phototoxicity, and deeper insight into the thermodynamics of metabolism of single cells.


Sign in / Sign up

Export Citation Format

Share Document