Comprehensive assessment of machine learning-based methods for predicting antimicrobial peptides

Author(s):  
Jing Xu ◽  
Fuyi Li ◽  
André Leier ◽  
Dongxu Xiang ◽  
Hsin-Hui Shen ◽  
...  

Abstract Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.

2019 ◽  
pp. 089443931988844
Author(s):  
Ranjith Vijayakumar ◽  
Mike W.-L. Cheung

Machine learning methods have become very popular in diverse fields due to their focus on predictive accuracy, but little work has been conducted on how to assess the replicability of their findings. We introduce and adapt replication methods advocated in psychology to the aims and procedural needs of machine learning research. In Study 1, we illustrate these methods with the use of an empirical data set, assessing the replication success of a predictive accuracy measure, namely, R 2 on the cross-validated and test sets of the samples. We introduce three replication aims. First, tests of inconsistency examine whether single replications have successfully rejected the original study. Rejection will be supported if the 95% confidence interval (CI) of R 2 difference estimates between replication and original does not contain zero. Second, tests of consistency help support claims of successful replication. We can decide apriori on a region of equivalence, where population values of the difference estimates are considered equivalent for substantive reasons. The 90% CI of a different estimate lying fully within this region supports replication. Third, we show how to combine replications to construct meta-analytic intervals for better precision of predictive accuracy measures. In Study 2, R 2 is reduced from the original in a subset of replication studies to examine the ability of the replication procedures to distinguish true replications from nonreplications. We find that when combining studies sampled from same population to form meta-analytic intervals, random-effects methods perform best for cross-validated measures while fixed-effects methods work best for test measures. Among machine learning methods, regression was comparable to many complex methods, while support vector machine performed most reliably across a variety of scenarios. Social scientists who use machine learning to model empirical data can use these methods to enhance the reliability of their findings.


2021 ◽  
Author(s):  
Qifei Zhao ◽  
Xiaojun Li ◽  
Yunning Cao ◽  
Zhikun Li ◽  
Jixin Fan

Abstract Collapsibility of loess is a significant factor affecting engineering construction in loess area, and testing the collapsibility of loess is costly. In this study, A total of 4,256 loess samples are collected from the north, east, west and middle regions of Xining. 70% of the samples are used to generate training data set, and the rest are used to generate verification data set, so as to construct and validate the machine learning models. The most important six factors are selected from thirteen factors by using Grey Relational analysis and multicollinearity analysis: burial depth、water content、specific gravity of soil particles、void rate、geostatic stress and plasticity limit. In order to predict the collapsibility of loess, four machine learning methods: Support Vector Machine (SVM), Random Subspace Based Support Vector Machine (RSSVM), Random Forest (RF) and Naïve Bayes Tree (NBTree), are studied and compared. The receiver operating characteristic (ROC) curve indicators, standard error (SD) and 95% confidence interval (CI) are used to verify and compare the models in different research areas. The results show that: RF model is the most efficient in predicting the collapsibility of loess in Xining, and its AUC average is above 80%, which can be used in engineering practice.


2020 ◽  
pp. 5-18
Author(s):  
N. N. Kiselyova ◽  
◽  
V. A. Dudarev ◽  
V. V. Ryazanov ◽  
O. V. Sen’ko ◽  
...  

New chalcospinels of the most common compositions were predicted: AIBIIICIVX4 (X — S or Se) and AIIBIIICIIIS4 (A, B, and C are various chemical elements). They are promising for the search for new materials for magneto-optical memory elements, sensors and anodes in sodium-ion batteries. The parameter “a” values of their crystal lattice are estimated. When predicting only the values of chemical elements properties were used. The calculations were carried out using machine learning programs that are part of the information-analytical system developed by the authors (various ensembles of algorithms of: the binary decision trees, the linear machine, the search for logical regularities of classes, the support vector machine, Fisher linear discriminant, the k-nearest neighbors, the learning a multilayer perceptron and a neural network), — for predicting chalcospinels not yet obtained, as well as an extensive family of regression methods, presented in the scikit-learn package for the Python language, and multilevel machine learning methods that were proposed by the authors — for estimation of the new chalcospinels lattice parameter value). The prediction accuracy of new chalcospinels according to the results of the cross-validation is not lower than 80%, and the prediction accuracy of the parameter of their crystal lattice (according to the results of calculating the mean absolute error (when cross-validation in the leave-one-out mode)) is ± 0.1 Å. The effectiveness of using multilevel machine learning methods to predict the physical properties of substances was shown.


Author(s):  
Wolfgang Drobetz ◽  
Tizian Otto

AbstractThis paper evaluates the predictive performance of machine learning methods in forecasting European stock returns. Compared to a linear benchmark model, interactions and nonlinear effects help improve the predictive performance. But machine learning models must be adequately trained and tuned to overcome the high dimensionality problem and to avoid overfitting. Across all machine learning methods, the most important predictors are based on price trends and fundamental signals from valuation ratios. However, the models exhibit substantial variation in statistical predictive performance that translate into pronounced differences in economic profitability. The return and risk measures of long-only trading strategies indicate that machine learning models produce sizeable gains relative to our benchmark. Neural networks perform best, also after accounting for transaction costs. A classification-based portfolio formation, utilizing a support vector machine that avoids estimating stock-level expected returns, performs even better than the neural network architecture.


2021 ◽  
Vol 4 ◽  
Author(s):  
Danielle Barnes ◽  
Luis Polanco ◽  
Jose A. Perea

Many and varied methods currently exist for featurization, which is the process of mapping persistence diagrams to Euclidean space, with the goal of maximally preserving structure. However, and to our knowledge, there are presently no methodical comparisons of existing approaches, nor a standardized collection of test data sets. This paper provides a comparative study of several such methods. In particular, we review, evaluate, and compare the stable multi-scale kernel, persistence landscapes, persistence images, the ring of algebraic functions, template functions, and adaptive template systems. Using these approaches for feature extraction, we apply and compare popular machine learning methods on five data sets: MNIST, Shape retrieval of non-rigid 3D Human Models (SHREC14), extracts from the Protein Classification Benchmark Collection (Protein), MPEG7 shape matching, and HAM10000 skin lesion data set. These data sets are commonly used in the above methods for featurization, and we use them to evaluate predictive utility in real-world applications.


2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Jingyi Mu ◽  
Fang Wu ◽  
Aihua Zhang

In the era of big data, many urgent issues to tackle in all walks of life all can be solved via big data technique. Compared with the Internet, economy, industry, and aerospace fields, the application of big data in the area of architecture is relatively few. In this paper, on the basis of the actual data, the values of Boston suburb houses are forecast by several machine learning methods. According to the predictions, the government and developers can make decisions about whether developing the real estate on corresponding regions or not. In this paper, support vector machine (SVM), least squares support vector machine (LSSVM), and partial least squares (PLS) methods are used to forecast the home values. And these algorithms are compared according to the predicted results. Experiment shows that although the data set exists serious nonlinearity, the experiment result also show SVM and LSSVM methods are superior to PLS on dealing with the problem of nonlinearity. The global optimal solution can be found and best forecasting effect can be achieved by SVM because of solving a quadratic programming problem. In this paper, the different computation efficiencies of the algorithms are compared according to the computing times of relevant algorithms.


2018 ◽  
Vol 18 (12) ◽  
pp. 987-997 ◽  
Author(s):  
Li Zhang ◽  
Hui Zhang ◽  
Haixin Ai ◽  
Huan Hu ◽  
Shimeng Li ◽  
...  

Toxicity evaluation is an important part of the preclinical safety assessment of new drugs, which is directly related to human health and the fate of drugs. It is of importance to study how to evaluate drug toxicity accurately and economically. The traditional in vitro and in vivo toxicity tests are laborious, time-consuming, highly expensive, and even involve animal welfare issues. Computational methods developed for drug toxicity prediction can compensate for the shortcomings of traditional methods and have been considered useful in the early stages of drug development. Numerous drug toxicity prediction models have been developed using a variety of computational methods. With the advance of the theory of machine learning and molecular representation, more and more drug toxicity prediction models are developed using a variety of machine learning methods, such as support vector machine, random forest, naive Bayesian, back propagation neural network. And significant advances have been made in many toxicity endpoints, such as carcinogenicity, mutagenicity, and hepatotoxicity. In this review, we aimed to provide a comprehensive overview of the machine learning based drug toxicity prediction studies conducted in recent years. In addition, we compared the performance of the models proposed in these studies in terms of accuracy, sensitivity, and specificity, providing a view of the current state-of-the-art in this field and highlighting the issues in the current studies.


2021 ◽  
Vol 13 (20) ◽  
pp. 4149
Author(s):  
Soo-In Sohn ◽  
Young-Ju Oh ◽  
Subramani Pandian ◽  
Yong-Ho Lee ◽  
John-Lewis Zinia Zaukuu ◽  
...  

The feasibility of rapid and non-destructive classification of six different Amaranthus species was investigated using visible-near-infrared (Vis-NIR) spectra coupled with chemometric approaches. The focus of this research would be to use a handheld spectrometer in the field to classify six Amaranthus sp. in different geographical regions of South Korea. Spectra were obtained from the adaxial side of the leaves at 1.5 nm intervals in the Vis-NIR spectral range between 400 and 1075 nm. The obtained spectra were assessed with four different preprocessing methods in order to detect the optimum preprocessing method with high classification accuracy. Preprocessed spectra of six Amaranthus sp. were used as input for the machine learning-based chemometric analysis. All the classification results were validated using cross-validation to produce robust estimates of classification accuracies. The different combinations of preprocessing and modeling were shown to have a classification accuracy of between 71% and 99.7% after the cross-validation. The combination of Savitzky-Golay preprocessing and Support vector machine showed a maximum mean classification accuracy of 99.7% for the discrimination of Amaranthus sp. Considering the high number of spectra involved in this study, the growth stage of the plants, varying measurement locations, and the scanning position of leaves on the plant are all important. We conclude that Vis-NIR spectroscopy, in combination with appropriate preprocessing and machine learning methods, may be used in the field to effectively classify Amaranthus sp. for the effective management of the weedy species and/or for monitoring their food applications.


2020 ◽  
Vol 98 (6) ◽  
Author(s):  
Anderson Antonio Carvalho Alves ◽  
Rebeka Magalhães da Costa ◽  
Tiago Bresolin ◽  
Gerardo Alves Fernandes Júnior ◽  
Rafael Espigolan ◽  
...  

Abstract The aim of this study was to compare the predictive performance of the Genomic Best Linear Unbiased Predictor (GBLUP) and machine learning methods (Random Forest, RF; Support Vector Machine, SVM; Artificial Neural Network, ANN) in simulated populations presenting different levels of dominance effects. Simulated genome comprised 50k SNP and 300 QTL, both biallelic and randomly distributed across 29 autosomes. A total of six traits were simulated considering different values for the narrow and broad-sense heritability. In the purely additive scenario with low heritability (h2 = 0.10), the predictive ability obtained using GBLUP was slightly higher than the other methods whereas ANN provided the highest accuracies for scenarios with moderate heritability (h2 = 0.30). The accuracies of dominance deviations predictions varied from 0.180 to 0.350 in GBLUP extended for dominance effects (GBLUP-D), from 0.06 to 0.185 in RF and they were null using the ANN and SVM methods. Although RF has presented higher accuracies for total genetic effect predictions, the mean-squared error values in such a model were worse than those observed for GBLUP-D in scenarios with large additive and dominance variances. When applied to prescreen important regions, the RF approach detected QTL with high additive and/or dominance effects. Among machine learning methods, only the RF was capable to cover implicitly dominance effects without increasing the number of covariates in the model, resulting in higher accuracies for the total genetic and phenotypic values as the dominance ratio increases. Nevertheless, whether the interest is to infer directly on dominance effects, GBLUP-D could be a more suitable method.


Author(s):  
Yafei Wu ◽  
Ya Fang

Timely stroke diagnosis and intervention are necessary considering its high prevalence. Previous studies have mainly focused on stroke prediction with balanced data. Thus, this study aimed to develop machine learning models for predicting stroke with imbalanced data in an elderly population in China. Data were obtained from a prospective cohort that included 1131 participants (56 stroke patients and 1075 non-stroke participants) in 2012 and 2014, respectively. Data balancing techniques including random over-sampling (ROS), random under-sampling (RUS), and synthetic minority over-sampling technique (SMOTE) were used to process the imbalanced data in this study. Machine learning methods such as regularized logistic regression (RLR), support vector machine (SVM), and random forest (RF) were used to predict stroke with demographic, lifestyle, and clinical variables. Accuracy, sensitivity, specificity, and areas under the receiver operating characteristic curves (AUCs) were used for performance comparison. The top five variables for stroke prediction were selected for each machine learning method based on the SMOTE-balanced data set. The total prevalence of stroke was high in 2014 (4.95%), with men experiencing much higher prevalence than women (6.76% vs. 3.25%). The three machine learning methods performed poorly in the imbalanced data set with extremely low sensitivity (approximately 0.00) and AUC (approximately 0.50). After using data balancing techniques, the sensitivity and AUC considerably improved with moderate accuracy and specificity, and the maximum values for sensitivity and AUC reached 0.78 (95% CI, 0.73–0.83) for RF and 0.72 (95% CI, 0.71–0.73) for RLR. Using AUCs for RLR, SVM, and RF in the imbalanced data set as references, a significant improvement was observed in the AUCs of all three machine learning methods (p < 0.05) in the balanced data sets. Considering RLR in each data set as a reference, only RF in the imbalanced data set and SVM in the ROS-balanced data set were superior to RLR in terms of AUC. Sex, hypertension, and uric acid were common predictors in all three machine learning methods. Blood glucose level was included in both RLR and RF. Drinking, age and high-sensitivity C-reactive protein level, and low-density lipoprotein cholesterol level were also included in RLR, SVM, and RF, respectively. Our study suggests that machine learning methods with data balancing techniques are effective tools for stroke prediction with imbalanced data.


Sign in / Sign up

Export Citation Format

Share Document