scholarly journals Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification

Molecules ◽  
2021 ◽  
Vol 26 (4) ◽  
pp. 1111 ◽  
Author(s):  
Anita Rácz ◽  
Dávid Bajusz ◽  
Károly Héberger

Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.

Materials ◽  
2021 ◽  
Vol 14 (5) ◽  
pp. 1089
Author(s):  
Sung-Hee Kim ◽  
Chanyoung Jeong

This study aims to demonstrate the feasibility of applying eight machine learning algorithms to predict the classification of the surface characteristics of titanium oxide (TiO2) nanostructures with different anodization processes. We produced a total of 100 samples, and we assessed changes in TiO2 nanostructures’ thicknesses by performing anodization. We successfully grew TiO2 films with different thicknesses by one-step anodization in ethylene glycol containing NH4F and H2O at applied voltage differences ranging from 10 V to 100 V at various anodization durations. We found that the thicknesses of TiO2 nanostructures are dependent on anodization voltages under time differences. Therefore, we tested the feasibility of applying machine learning algorithms to predict the deformation of TiO2. As the characteristics of TiO2 changed based on the different experimental conditions, we classified its surface pore structure into two categories and four groups. For the classification based on granularity, we assessed layer creation, roughness, pore creation, and pore height. We applied eight machine learning techniques to predict classification for binary and multiclass classification. For binary classification, random forest and gradient boosting algorithm had relatively high performance. However, all eight algorithms had scores higher than 0.93, which signifies high prediction on estimating the presence of pore. In contrast, decision tree and three ensemble methods had a relatively higher performance for multiclass classification, with an accuracy rate greater than 0.79. The weakest algorithm used was k-nearest neighbors for both binary and multiclass classifications. We believe that these results show that we can apply machine learning techniques to predict surface quality improvement, leading to smart manufacturing technology to better control color appearance, super-hydrophobicity, super-hydrophilicity or batter efficiency.


World Health Organization’s (WHO) report 2018, on diabetes has reported that the number of diabetic cases has increased from one hundred eight million to four hundred twenty-two million from the year 1980. The fact sheet shows that there is a major increase in diabetic cases from 4.7% to 8.5% among adults (18 years of age). Major health hazards caused due to diabetes include kidney function failure, heart disease, blindness, stroke, and lower limb dismembering. This article applies supervised machine learning algorithms on the Pima Indian Diabetic dataset to explore various patterns of risks involved using predictive models. Predictive model construction is based upon supervised machine learning algorithms: Naïve Bayes, Decision Tree, Random Forest, Gradient Boosted Tree, and Tree Ensemble. Further, the analytical patterns about these predictive models have been presented based on various performance parameters which include accuracy, precision, recall, and F-measure.


2020 ◽  
Vol 10 (15) ◽  
pp. 5075 ◽  
Author(s):  
Peng Fang ◽  
Xiwang Zhang ◽  
Panpan Wei ◽  
Yuanzheng Wang ◽  
Huiyi Zhang ◽  
...  

Machine learning algorithms are crucial for crop identification and mapping. However, many works only focus on the identification results of these algorithms, but pay less attention to their classification performance and mechanism. In this paper, based on Google Earth Engine (GEE), Sentinel-2 10 m resolution images during a specific phenological period of winter wheat were obtained. Then, support vector machine (SVM), random forest (RF), and classification and regression tree (CART) machine learning algorithms were employed to identify and map winter wheat in a large-scale area. The hyperparameters of the three machine learning algorithms were tuned by grid search and the 5-fold cross-validation method. The classification performance of the three machine learning algorithms were compared, the results of which demonstrate that SVM achieves best performance in identifying winter wheat, and its overall accuracy (OA), user’s accuracy (UA), producer’s accuracy (PA), and kappa coefficient (Kappa) are 0.94, 0.95, 0.95, and 0.92, respectively. Moreover, 50 various combinations of training and validation sets were used to analyze the generalization ability of the algorithms, and the results show that the average OA of SVM, RF, and CART are 0.93, 0.92, and 0.88, respectively, thus indicating that SVM and RF are more robust than CART. To further explore the sensitivity of SVM, RF, and CART to variations of the algorithm parameters—namely, (C and gamma), (tree and split), and (maxD and minSP)—we employed the grid search method to iterate these parameters, respectively, and to analyze the effect of these parameters on the accuracy scores and classification residuals. It was found that with the change of (C and gamma) in (0.01~1000), SVM’s maximum variation of accuracy score is up to 0.63, and the maximum variation of residuals is 76,215 km2. We concluded that SVM is sensitive to the parameters (C and gamma) and presents a positive correlation. When the parameters (tree and split) change between (100~600) and (1~6), respectively, the RF’s maximum variation of accuracy score is 0.08, and the maximum variation of residuals is 1157 km2, indicating that RF is low in sensitivity toward the parameters (tree and split). When the parameters (maxD and minSP) are between (10~60), the maximum accuracy change value is 0.06, and the maximum variation of residuals is 6943 km2. Therefore, compared to RF, CART is sensitive to the parameters (maxD and minSP) and has poor robustness. In general, under the conditions of the hyperparameters, SVM and RF exhibit optimal classification performance, while CART has relatively inferior performance. Meanwhile, SVM, RF, and CART have different sensitivities toward the algorithm parameters; that is, SVM and CART are more sensitive to the algorithm parameters, while RF has low sensitivity toward changes in the algorithm parameters. The different parameters cause great changes in the accuracy scores and residuals, so it is necessary to determine the algorithm hyperparameters. Generally, default parameters can be used to achieve crop classification, but we recommend the enumeration method, similar to grid search, as a practical way to improve the classification performance of the algorithm if the best classification effect is expected.


Eos ◽  
2022 ◽  
Vol 103 ◽  
Author(s):  
JoAnna Wendel

Researchers applied machine learning algorithms to several distinct chemical compositions of Mars and suggest that these algorithms could be a powerful tool to map the planet’s surface on a large scale.


Author(s):  
Junggu Choi ◽  
Seoyoung Cho ◽  
Inhwan Ko ◽  
Sanghoon Han

Investigating suicide risk factors is critical for socioeconomic and public health, and many researchers have tried to identify factors associated with suicide. In this study, the risk factors for suicidal ideation were compared, and the contributions of different factors to suicidal ideation and attempt were investigated. To reflect the diverse characteristics of the population, the large-scale and longitudinal dataset used in this study included both socioeconomic and clinical variables collected from the Korean public. Three machine learning algorithms (XGBoost classifier, support vector classifier, and logistic regression) were used to detect the risk factors for both suicidal ideation and attempt. The importance of the variables was determined using the model with the best classification performance. In addition, a novel risk-factor score, calculated from the rank and importance scores of each variable, was proposed. Socioeconomic and sociodemographic factors showed a high correlation with risks for both ideation and attempt. Mental health variables ranked higher than other factors in suicidal attempts, posing a relatively higher suicide risk than ideation. These trends were further validated using the conditions from the integrated and yearly dataset. This study provides novel insights into suicidal risk factors for suicidal ideations and attempts.


Machine learning (ML) has become the most predominant methodology that shows good results in the classification and prediction domains. Predictive systems are being employed to predict events and its results in almost every walk of life. The field of prediction in sports is gaining importance as there is a huge community of betters and sports fans. Moreover team owners and club managers are struggling for Machine learning models that could be used for formulating strategies to win matches. Numerous factors such as results of previous matches, indicators of player performance and opponent information are required to build these models. This paper provides an analysis of such key models focusing on application of machine learning algorithms to sport result prediction. The results obtained helped us to elucidate the best combination of feature selection and classification algorithms that render maximum accuracy in sport result prediction.


Sign in / Sign up

Export Citation Format

Share Document