Training set formation in machine learning problems (review)

Introduction: Proper training set formation is a key factor in machine learning. In real training sets, problems and errors commonly occur, having a critical impact on the training result. Training set need to be formed in all machine learning problems; therefore, knowledge of possible difficulties will be helpful. Purpose: Overview of possible problems in the formation of a training set, in order to facilitate their detection and elimination when working with real training sets. Analyzing the impact of these problems on the results of the training. Results: The article makes on overview of possible errors in training set formation, such as lack of data, imbalance, false patterns, sampling from a limited set of sources, change in the general population over time, and others. We discuss the influence of these errors on the result of the training, test set formation, and training algorithm quality measurement. The pseudo-labeling, data augmentation, and hard samples mining are considered the most effective ways to expand a training set. We offer practical recommendations for the formation of a training or test set. Examples from the practice of Kaggle competitions are given. For the problem of cross-dataset generalization in neural network training, we propose an algorithm called Cross-Dataset Machine, which is simple to implement and allows you to get a gain in cross-dataset generalization. Practical relevance: The materials of the article can be used as a practical guide in solving machine learning problems.

Download Full-text

Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data

Bioinformatics ◽

10.1093/bioinformatics/btz183 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3989-3995 ◽

Cited By ~ 17

Author(s):

Hongjian Li ◽

Jiangjun Peng ◽

Pavel Sidorov ◽

Yee Leung ◽

Kwong-Sak Leung ◽

...

Keyword(s):

Machine Learning ◽

Protein Structures ◽

Superior Performance ◽

Supplementary Information ◽

Scoring Functions ◽

Training Set ◽

Test Set ◽

Set Size ◽

Extreme Gradient Boosting ◽

The Impact

Abstract Motivation Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes. Results We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing. Availability and implementation https://github.com/HongjianLi/MLSF Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

Classification of multiwavelength transients with Machine Learning

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3873 ◽

2020 ◽

Author(s):

K Sooknunan ◽

M Lochner ◽

Bruce A Bassett ◽

H V Peiris ◽

R Fender ◽

...

Keyword(s):

Machine Learning ◽

Small Sample ◽

Light Curves ◽

Machine Learning Techniques ◽

Optical Data ◽

Test Time ◽

Test Accuracy ◽

Training Set ◽

The Impact

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.

Download Full-text

Exploratory analysis on prediction of loan privilege for customers using random forest

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.21.12399 ◽

2018 ◽

Vol 7 (2.21) ◽

pp. 339 ◽

Cited By ~ 1

Author(s):

K Ulaga Priya ◽

S Pushpa ◽

K Kalaivani ◽

A Sartiha

Keyword(s):

Machine Learning ◽

Random Forest ◽

Data Model ◽

Model Evaluation ◽

Banking Industry ◽

Performance Parameters ◽

Training Set ◽

Test Set ◽

Learning Technique ◽

Analytical Processing

In Banking Industry loan Processing is a tedious task in identifying the default customers. Manual prediction of default customers might turn into a bad loan in future. Banks possess huge volume of behavioral data from which they are unable to make a judgement about prediction of loan defaulters. Modern techniques like Machine Learning will help to do analytical processing using Supervised Learning and Unsupervised Learning Technique. A data model for predicting default customers using Random forest Technique has been proposed. Data model Evaluation is done on training set and based on the performance parameters final prediction is done on the Test set. This is an evident that Random Forest technique will help the bank to predict the loan Defaulters with utmost accuracy.

Download Full-text

North American Hardwoods Identification Using Machine-Learning

Forests ◽

10.3390/f11030298 ◽

2020 ◽

Vol 11 (3) ◽

pp. 298 ◽

Cited By ~ 2

Author(s):

Dercilio Junior Verly Lopes ◽

Greg W. Burgreen ◽

Edward D. Entsminger

Keyword(s):

Machine Learning ◽

North American ◽

Mobile Application ◽

Cross Validation ◽

Data Augmentation ◽

Technical Note ◽

Machine Learning Method ◽

Training Set ◽

Hardwood Species ◽

Fold Cross Validation

This technical note determines the feasibility of using an InceptionV4_ResNetV2 convolutional neural network (CNN) to correctly identify hardwood species from macroscopic images. The method is composed of a commodity smartphone fitted with a 14× macro lens for photography. The end-grains of ten different North American hardwood species were photographed to create a dataset of 1869 images. The stratified 5-fold cross-validation machine-learning method was used, in which the number of testing samples varied from 341 to 342. Data augmentation was performed on-the-fly for each training set by rotating, zooming, and flipping images. It was found that the CNN could correctly identify hardwood species based on macroscopic images of its end-grain with an adjusted accuracy of 92.60%. With the current growing of machine-learning field, this model can then be readily deployed in a mobile application for field wood identification.

Download Full-text

Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement

Information ◽

10.3390/info11060332 ◽

2020 ◽

Vol 11 (6) ◽

pp. 332

Author(s):

Ernest Kwame Ampomah ◽

Zhiguang Qin ◽

Gabriel Nyame

Keyword(s):

Machine Learning ◽

Stock Market ◽

Stock Price ◽

Superior Performance ◽

Operating Characteristics ◽

Training Set ◽

Data Set ◽

Test Set ◽

Ensemble Machine Learning ◽

Better Than

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.

Download Full-text

New Entropy Based Distance for Training Set Selection in Debt Portfolio Valuation

International Journal of Information Technology and Web Engineering ◽

10.4018/jitwe.2012040105 ◽

2012 ◽

Vol 7 (2) ◽

pp. 60-69

Author(s):

Tomasz Kajdanowicz ◽

Slawomir Plamowski ◽

Przemyslaw Kazienko

Keyword(s):

Machine Learning ◽

Distance Measure ◽

Prediction Performance ◽

Training Set ◽

Learning Tasks ◽

Real Domain ◽

Valuation Process ◽

Proper Training ◽

Training Set Selection ◽

Portfolio Valuation

Choosing a proper training set for machine learning tasks is of great importance in complex domain problems. In the paper a new distance measure for training set selection is presented and thoroughly discussed. The distance between two datasets is computed using variance of entropy in groups obtained after clustering. The approach is validated using real domain datasets from debt portfolio valuation process. Eventually, prediction performance is examined.

Download Full-text

Combination of Albumin-Globulin Score and Sarcopenia to Predict Prognosis in Patients With Renal Cell Carcinoma Undergoing Laparoscopic Nephrectomy

Frontiers in Nutrition ◽

10.3389/fnut.2021.731466 ◽

2021 ◽

Vol 8 ◽

Author(s):

Weipu Mao ◽

Nieke Zhang ◽

Keyi Wang ◽

Qiang Hu ◽

Si Sun ◽

...

Keyword(s):

Renal Cell Carcinoma ◽

Cell Carcinoma ◽

Renal Cell ◽

Cox Regression ◽

Training Set ◽

Test Set ◽

Cox Regression Analysis ◽

Inflammatory Status ◽

Grade 3 ◽

The Impact

We conducted a multicenter clinical study to construct a novel index based on a combination of albumin-globulin score and sarcopenia (CAS) that can comprehensively reflect patients' nutritional and inflammatory status and assess the prognostic value of CAS in renal cell carcinoma (RCC) patients. Between 2014 and 2019, 443 patients from 3 centers who underwent nephrectomy were collected (343 in the training set and 100 in the test set). Kaplan-Meier curves were employed to analyze the impact of albumin-globulin ratio (AGR), albumin-globulin score (AGS), sarcopenia, and CAS on overall survival (OS) and cancer-specific survival (CSS) in RCC patients. Receiver operating characteristic (ROC) curves were used to assess the predictive ability of AGR, AGS, sarcopenia, and CAS on prognosis. High AGR, low AGS, and nonsarcopenia were associated with higher OS and CSS. According to CAS, the training set included 60 (17.5%) patients in grade 1, 176 (51.3%) patients in grade 2, and 107 (31.2%) patients in grade 3. Lower CAS was linked to longer OS and CSS. Multivariate Cox regression analysis revealed that CAS was an independent risk factor for OS (grade 1 vs. grade 3: aHR = 0.08; 95% CI: 0.01–0.58, p = 0.012; grade 2 vs. grade 3: aHR = 0.47; 95% CI: 0.25–0.88, p = 0.018) and CSS (grade 1 vs. grade 3: aHR = 0.12; 95% CI: 0.02–0.94, p = 0.043; grade 2 vs. grade 3: aHR = 0.31; 95% CI: 0.13–0.71, p = 0.006) in RCC patients undergoing nephrectomy. Additionally, CAS had higher accuracy in predicting OS (AUC = 0.687) and CSS (AUC = 0.710) than AGR, AGS, and sarcopenia. In addition, similar results were obtained in the test set. The novel index CAS developed in this study, which reflects patients' nutritional and inflammatory status, can better predict the prognosis of RCC patients.

Download Full-text

An application of machine learning to assist medication order review by pharmacists in a health care center

10.1101/19013029 ◽

2019 ◽

Author(s):

Maxime Thibault ◽

Denis Lebel

Keyword(s):

Machine Learning ◽

Neonatal Intensive Care ◽

Care Center ◽

Variable Order ◽

Medication Order ◽

Training Set ◽

Test Set ◽

Health Care Center ◽

Clinical Benefits ◽

Fold Cross Validation

AbstractThe objective of this study was to determine if it is feasible to use machine learning to evaluate how a medication order is contextually appropriate for a patient, in order to assist order review by pharmacists. A neural network was constructed using as input the sequence of word2vec embeddings of the 30 previous orders, as well as the currently active medications, pharmacological classes and ordering department, to predict the next order. The model was trained with data from 2013 to 2017, optimized using 5-fold cross-validation, and tested on orders from 2018. A survey was developed to obtain pharmacist ratings on a sample of 20 orders, which were compared with predictions. The training set included 1 022 272 orders. The test set included 95 310 orders. Baseline training set top 1, top 10 and top 30 accuracy using a dummy classifier were respectively 4.5%, 23.6% and 44.1%. Final test set accuracies were, respectively, 44.4%, 69.9% and 80.4%. Populations in which the model performed the best were obstetrics and gynecology patients and newborn babies (either in or out of neonatal intensive care). Pharmacists agreed poorly on their ratings of sampled orders with a Fleiss kappa of 0.283. The breakdown of metrics by population showed better performance in patients following less variable order patterns, indicating potential usefulness in triaging routine orders to less extensive pharmacist review. We conclude that machine learning has potential for helping pharmacists review medication orders. Future studies should aim at evaluating the clinical benefits of using such a model in practice.

Download Full-text

Prediction of Aquatic Ecosystem Health Indices through Machine Learning Models Using the WGAN-Based Data Augmentation Method

Sustainability ◽

10.3390/su131810435 ◽

2021 ◽

Vol 13 (18) ◽

pp. 10435

Author(s):

Seoro Lee ◽

Jonggun Kim ◽

Gwanjae Lee ◽

Jiyeong Hong ◽

Joo Hyun Bae ◽

...

Keyword(s):

Machine Learning ◽

Aquatic Ecosystem ◽

Ecosystem Health ◽

Data Augmentation ◽

Predictive Performance ◽

Classification Problems ◽

Training Set ◽

Generative Adversarial Network ◽

Assessment Index ◽

Data Imbalance

Changes in hydrological characteristics and increases in various pollutant loadings due to rapid climate change and urbanization have a significant impact on the deterioration of aquatic ecosystem health (AEH). Therefore, it is important to effectively evaluate the AEH in advance and establish appropriate strategic plans. Recently, machine learning (ML) models have been widely used to solve hydrological and environmental problems in various fields. However, in general, collecting sufficient data for ML training is time-consuming and labor-intensive. Especially in classification problems, data imbalance can lead to erroneous prediction results of ML models. In this study, we proposed a method to solve the data imbalance problem through data augmentation based on Wasserstein Generative Adversarial Network (WGAN) and to efficiently predict the grades (from A to E grades) of AEH indices (i.e., Benthic Macroinvertebrate Index (BMI), Trophic Diatom Index (TDI), Fish Assessment Index (FAI)) through the ML models. Raw datasets for the AEH indices composed of various physicochemical factors (i.e., WT, DO, BOD5, SS, TN, TP, and Flow) and AEH grades were built and augmented through the WGAN. The performance of each ML model was evaluated through a 10-fold cross-validation (CV), and the performances of the ML models trained on the raw and WGAN-based training sets were compared and analyzed through AEH grade prediction on the test sets. The results showed that the ML models trained on the WGAN-based training set had an average F1-score for grades of each AEH index of 0.9 or greater for the test set, which was superior to the models trained on the raw training set (fewer data compared to other datasets) only. Through the above results, it was confirmed that by using the dataset augmented through WGAN, the ML model can yield better AEH grade predictive performance compared to the model trained on limited datasets; this approach reduces the effort needed for actual data collection from rivers which requires enormous time and cost. In the future, the results of this study can be used as basic data to construct big data of aquatic ecosystems, needed to efficiently evaluate and predict AEH in rivers based on the ML models.

Download Full-text