Machine Learning for Pre-Auction Sample Selection

Author(s):  
Indika Hiran Wijesinghe ◽  
Sameera D Viswakula
2019 ◽  
Vol 11 (2) ◽  
pp. 185 ◽  
Author(s):  
Christopher A. Ramezan ◽  
Timothy A. Warner ◽  
Aaron E. Maxwell

High spatial resolution (1–5 m) remotely sensed datasets are increasingly being used to map land covers over large geographic areas using supervised machine learning algorithms. Although many studies have compared machine learning classification methods, sample selection methods for acquiring training and validation data for machine learning, and cross-validation techniques for tuning classifier parameters are rarely investigated, particularly on large, high spatial resolution datasets. This work, therefore, examines four sample selection methods—simple random, proportional stratified random, disproportional stratified random, and deliberative sampling—as well as three cross-validation tuning approaches—k-fold, leave-one-out, and Monte Carlo methods. In addition, the effect on the accuracy of localizing sample selections to a small geographic subset of the entire area, an approach that is sometimes used to reduce costs associated with training data collection, is investigated. These methods are investigated in the context of support vector machines (SVM) classification and geographic object-based image analysis (GEOBIA), using high spatial resolution National Agricultural Imagery Program (NAIP) orthoimagery and LIDAR-derived rasters, covering a 2,609 km2 regional-scale area in northeastern West Virginia, USA. Stratified-statistical-based sampling methods were found to generate the highest classification accuracy. Using a small number of training samples collected from only a subset of the study area provided a similar level of overall accuracy to a sample of equivalent size collected in a dispersed manner across the entire regional-scale dataset. There were minimal differences in accuracy for the different cross-validation tuning methods. The processing time for Monte Carlo and leave-one-out cross-validation were high, especially with large training sets. For this reason, k-fold cross-validation appears to be a good choice. Classifications trained with samples collected deliberately (i.e., not randomly) were less accurate than classifiers trained from statistical-based samples. This may be due to the high positive spatial autocorrelation in the deliberative training set. Thus, if possible, samples for training should be selected randomly; deliberative samples should be avoided.


2020 ◽  
Vol 12 (12) ◽  
pp. 231
Author(s):  
Julián Miranda ◽  
Angélica Flórez ◽  
Gustavo Ospina ◽  
Ciro Gamboa ◽  
Carlos Flórez ◽  
...  

This paper presents an integrated model for seismic events detection in Colombia using machine learning techniques. Machine learning is used to identify P-wave windows in historic records and hence detect seismic events. The proposed model has five modules that group the basic detection system procedures: the seeking, gathering, and storage seismic data module, the reading of seismic records module, the analysis of seismological stations module, the sample selection module, and the classification process module. An explanation of each module is given in conjunction with practical recommendations for its implementation. The resulting model allows understanding the integration of the phases required for the design and development of an offline seismic event detection system.


2017 ◽  
Vol 2017 ◽  
pp. 1-18 ◽  
Author(s):  
Norbert Krautenbacher ◽  
Fabian J. Theis ◽  
Christiane Fuchs

Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R packagesambia.


10.29007/thws ◽  
2019 ◽  
Author(s):  
Lukas Hahn ◽  
Lutz Roese-Koerner ◽  
Peet Cremer ◽  
Urs Zimmermann ◽  
Ori Maoz ◽  
...  

Active Learning is concerned with the question of how to identify the most useful samples for a Machine Learning algorithm to be trained with. When applied correctly, it can be a very powerful tool to counteract the immense data requirements of Artificial Neural Networks. However, we find that it is often applied with not enough care and domain knowledge. As a consequence, unrealistic hopes are raised and transfer of the experimental results from one dataset to another becomes unnecessarily hard.In this work we analyse the robustness of different Active Learning methods with respect to classifier capacity, exchangeability and type, as well as hyperparameters and falsely labelled data. Experiments reveal possible biases towards the architecture used for sample selection, resulting in suboptimal performance for other classifiers. We further propose the new ”Sum of Squared Logits” method based on the Simpson diversity index and investigate the effect of using the confusion matrix for balancing in sample selection.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Abhigyan Nath ◽  
André Leier

Abstract Background Cytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine–receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases—notably autoimmune, inflammatory and infectious diseases—and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. “Gold Standard” negative datasets are still lacking and strong biases in negative datasets can significantly affect the training of learning algorithms and their evaluation. To mitigate the unrepresentativeness and bias inherent in the negative sample selection (non-interacting proteins), we propose a clustering-based approach for representative negative sample selection. Results We used deep autoencoders to investigate the effect of different sampling approaches for non-interacting pairs on the training and the performance of machine learning classifiers. By using the anomaly detection capabilities of deep autoencoders we deduced the effects of different categories of negative samples on the training of learning algorithms. Random sampling for selecting non-interacting pairs results in either over- or under-representation of hard or easy to classify instances. When K-means based sampling of negative datasets is applied to mitigate the inadequacies of random sampling, random forest (RF) together with the combined feature set of atomic composition, physicochemical-2grams and two different representations of evolutionary information performs best. Average model performances based on leave-one-out cross validation (loocv) over ten different negative sample sets that each model was trained with, show that RF models significantly outperform the previous best CRI predictor in terms of accuracy (+ 5.1%), specificity (+ 13%), mcc (+ 0.1) and g-means value (+ 5.1). Evaluations using tenfold cv and training/testing splits confirm the competitive performance. Conclusions A comparative analysis was performed to assess the effect of three different sampling methods (random, K-means and uniform sampling) on the training of learning algorithms using different evaluation methods. Models trained on K-means sampled datasets generally show a significantly improved performance compared to those trained on random selections—with RF seemingly benefiting most in our particular setting. Our findings on the sampling are highly relevant and apply to many applications of supervised learning approaches in bioinformatics.


2020 ◽  
Vol 43 ◽  
Author(s):  
Myrthe Faber

Abstract Gilead et al. state that abstraction supports mental travel, and that mental travel critically relies on abstraction. I propose an important addition to this theoretical framework, namely that mental travel might also support abstraction. Specifically, I argue that spontaneous mental travel (mind wandering), much like data augmentation in machine learning, provides variability in mental content and context necessary for abstraction.


2020 ◽  
Author(s):  
Mohammed J. Zaki ◽  
Wagner Meira, Jr
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document