scholarly journals Training set formation in machine learning problems (review)

Author(s):  
Andrey Parasich ◽  
Victor Parasich ◽  
Irina Parasich

Introduction: Proper training set formation is a key factor in machine learning. In real training sets, problems and errors commonly occur, having a critical impact on the training result. Training set need to be formed in all machine learning problems; therefore, knowledge of possible difficulties will be helpful. Purpose: Overview of possible problems in the formation of a training set, in order to facilitate their detection and elimination when working with real training sets. Analyzing the impact of these problems on the results of the training.  Results: The article makes on overview of possible errors in training set formation, such as lack of data, imbalance, false patterns, sampling from a limited set of sources, change in the general population over time, and others. We discuss the influence of these errors on the result of the training, test set formation, and training algorithm quality measurement. The pseudo-labeling, data augmentation, and hard samples mining are considered the most effective ways to expand a training set. We offer practical recommendations for the formation of a training or test set. Examples from the practice of Kaggle competitions are given. For the problem of cross-dataset generalization in neural network training, we propose an algorithm called Cross-Dataset Machine, which is simple to implement and allows you to get a gain in cross-dataset generalization. Practical relevance: The materials of the article can be used as a practical guide in solving machine learning problems.

2019 ◽  
Vol 35 (20) ◽  
pp. 3989-3995 ◽  
Author(s):  
Hongjian Li ◽  
Jiangjun Peng ◽  
Pavel Sidorov ◽  
Yee Leung ◽  
Kwong-Sak Leung ◽  
...  

Abstract Motivation Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes. Results We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing. Availability and implementation https://github.com/HongjianLi/MLSF Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


Author(s):  
K Sooknunan ◽  
M Lochner ◽  
Bruce A Bassett ◽  
H V Peiris ◽  
R Fender ◽  
...  

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.


2018 ◽  
Vol 7 (2.21) ◽  
pp. 339 ◽  
Author(s):  
K Ulaga Priya ◽  
S Pushpa ◽  
K Kalaivani ◽  
A Sartiha

In Banking Industry loan Processing is a tedious task in identifying the default customers. Manual prediction of default customers might turn into a bad loan in future. Banks possess huge volume of behavioral data from which they are unable to make a judgement about prediction of loan defaulters. Modern techniques like Machine Learning will help to do analytical processing using Supervised Learning and Unsupervised Learning Technique. A data model for predicting default customers using Random forest Technique has been proposed. Data model Evaluation is done on training set and based on the performance parameters final prediction is done on the Test set. This is an evident that Random Forest technique will help the bank to predict the loan Defaulters with utmost accuracy.  


Forests ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 298 ◽  
Author(s):  
Dercilio Junior Verly Lopes ◽  
Greg W. Burgreen ◽  
Edward D. Entsminger

This technical note determines the feasibility of using an InceptionV4_ResNetV2 convolutional neural network (CNN) to correctly identify hardwood species from macroscopic images. The method is composed of a commodity smartphone fitted with a 14× macro lens for photography. The end-grains of ten different North American hardwood species were photographed to create a dataset of 1869 images. The stratified 5-fold cross-validation machine-learning method was used, in which the number of testing samples varied from 341 to 342. Data augmentation was performed on-the-fly for each training set by rotating, zooming, and flipping images. It was found that the CNN could correctly identify hardwood species based on macroscopic images of its end-grain with an adjusted accuracy of 92.60%. With the current growing of machine-learning field, this model can then be readily deployed in a mobile application for field wood identification.


Information ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 332
Author(s):  
Ernest Kwame Ampomah ◽  
Zhiguang Qin ◽  
Gabriel Nyame

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.


Author(s):  
Tomasz Kajdanowicz ◽  
Slawomir Plamowski ◽  
Przemyslaw Kazienko

Choosing a proper training set for machine learning tasks is of great importance in complex domain problems. In the paper a new distance measure for training set selection is presented and thoroughly discussed. The distance between two datasets is computed using variance of entropy in groups obtained after clustering. The approach is validated using real domain datasets from debt portfolio valuation process. Eventually, prediction performance is examined.


2021 ◽  
Vol 8 ◽  
Author(s):  
Weipu Mao ◽  
Nieke Zhang ◽  
Keyi Wang ◽  
Qiang Hu ◽  
Si Sun ◽  
...  

We conducted a multicenter clinical study to construct a novel index based on a combination of albumin-globulin score and sarcopenia (CAS) that can comprehensively reflect patients' nutritional and inflammatory status and assess the prognostic value of CAS in renal cell carcinoma (RCC) patients. Between 2014 and 2019, 443 patients from 3 centers who underwent nephrectomy were collected (343 in the training set and 100 in the test set). Kaplan-Meier curves were employed to analyze the impact of albumin-globulin ratio (AGR), albumin-globulin score (AGS), sarcopenia, and CAS on overall survival (OS) and cancer-specific survival (CSS) in RCC patients. Receiver operating characteristic (ROC) curves were used to assess the predictive ability of AGR, AGS, sarcopenia, and CAS on prognosis. High AGR, low AGS, and nonsarcopenia were associated with higher OS and CSS. According to CAS, the training set included 60 (17.5%) patients in grade 1, 176 (51.3%) patients in grade 2, and 107 (31.2%) patients in grade 3. Lower CAS was linked to longer OS and CSS. Multivariate Cox regression analysis revealed that CAS was an independent risk factor for OS (grade 1 vs. grade 3: aHR = 0.08; 95% CI: 0.01–0.58, p = 0.012; grade 2 vs. grade 3: aHR = 0.47; 95% CI: 0.25–0.88, p = 0.018) and CSS (grade 1 vs. grade 3: aHR = 0.12; 95% CI: 0.02–0.94, p = 0.043; grade 2 vs. grade 3: aHR = 0.31; 95% CI: 0.13–0.71, p = 0.006) in RCC patients undergoing nephrectomy. Additionally, CAS had higher accuracy in predicting OS (AUC = 0.687) and CSS (AUC = 0.710) than AGR, AGS, and sarcopenia. In addition, similar results were obtained in the test set. The novel index CAS developed in this study, which reflects patients' nutritional and inflammatory status, can better predict the prognosis of RCC patients.


2019 ◽  
Author(s):  
Maxime Thibault ◽  
Denis Lebel

AbstractThe objective of this study was to determine if it is feasible to use machine learning to evaluate how a medication order is contextually appropriate for a patient, in order to assist order review by pharmacists. A neural network was constructed using as input the sequence of word2vec embeddings of the 30 previous orders, as well as the currently active medications, pharmacological classes and ordering department, to predict the next order. The model was trained with data from 2013 to 2017, optimized using 5-fold cross-validation, and tested on orders from 2018. A survey was developed to obtain pharmacist ratings on a sample of 20 orders, which were compared with predictions. The training set included 1 022 272 orders. The test set included 95 310 orders. Baseline training set top 1, top 10 and top 30 accuracy using a dummy classifier were respectively 4.5%, 23.6% and 44.1%. Final test set accuracies were, respectively, 44.4%, 69.9% and 80.4%. Populations in which the model performed the best were obstetrics and gynecology patients and newborn babies (either in or out of neonatal intensive care). Pharmacists agreed poorly on their ratings of sampled orders with a Fleiss kappa of 0.283. The breakdown of metrics by population showed better performance in patients following less variable order patterns, indicating potential usefulness in triaging routine orders to less extensive pharmacist review. We conclude that machine learning has potential for helping pharmacists review medication orders. Future studies should aim at evaluating the clinical benefits of using such a model in practice.


2021 ◽  
Vol 13 (18) ◽  
pp. 10435
Author(s):  
Seoro Lee ◽  
Jonggun Kim ◽  
Gwanjae Lee ◽  
Jiyeong Hong ◽  
Joo Hyun Bae ◽  
...  

Changes in hydrological characteristics and increases in various pollutant loadings due to rapid climate change and urbanization have a significant impact on the deterioration of aquatic ecosystem health (AEH). Therefore, it is important to effectively evaluate the AEH in advance and establish appropriate strategic plans. Recently, machine learning (ML) models have been widely used to solve hydrological and environmental problems in various fields. However, in general, collecting sufficient data for ML training is time-consuming and labor-intensive. Especially in classification problems, data imbalance can lead to erroneous prediction results of ML models. In this study, we proposed a method to solve the data imbalance problem through data augmentation based on Wasserstein Generative Adversarial Network (WGAN) and to efficiently predict the grades (from A to E grades) of AEH indices (i.e., Benthic Macroinvertebrate Index (BMI), Trophic Diatom Index (TDI), Fish Assessment Index (FAI)) through the ML models. Raw datasets for the AEH indices composed of various physicochemical factors (i.e., WT, DO, BOD5, SS, TN, TP, and Flow) and AEH grades were built and augmented through the WGAN. The performance of each ML model was evaluated through a 10-fold cross-validation (CV), and the performances of the ML models trained on the raw and WGAN-based training sets were compared and analyzed through AEH grade prediction on the test sets. The results showed that the ML models trained on the WGAN-based training set had an average F1-score for grades of each AEH index of 0.9 or greater for the test set, which was superior to the models trained on the raw training set (fewer data compared to other datasets) only. Through the above results, it was confirmed that by using the dataset augmented through WGAN, the ML model can yield better AEH grade predictive performance compared to the model trained on limited datasets; this approach reduces the effort needed for actual data collection from rivers which requires enormous time and cost. In the future, the results of this study can be used as basic data to construct big data of aquatic ecosystems, needed to efficiently evaluate and predict AEH in rivers based on the ML models.


Sign in / Sign up

Export Citation Format

Share Document