Non-removal strategy for outliers in predictive models: The PAELLA algorithm case

Abstract This paper reports the experience of using the PAELLA algorithm as a helper tool in robust regression instead of as originally intended for outlier identification and removal. This novel usage of the algorithm takes advantage of the occurrence vector calculated by the algorithm in order to strengthen the effect of the more reliable samples and lessen the impact of those that otherwise would be considered outliers. Following that aim, a series of experiments is conducted in order to learn how to better use the information contained in the occurrence vector. Using a contrively difficult artificial data set, a reference predictive model is fit using the whole raw dataset. The second experiment reports the results of fitting a similar predictive model but discarding the samples marked as outliers by PAELLA. The third experiment uses the occurrence vector provided by PAELLA in order to classify the observations in multiple bins and fit every possible model changing which bins are considered for fitting and which are discarded in that particular model. The fourth experiment introduces a sampling process before fitting in which the occurrence vector represents the likelihood of being considered in the training data set. The fifth experiment considers the sampling process as an internal step to be performed interleaved between the training epochs. The last experiment compares our approach using weighted neural networks to a state of the art method.

Download Full-text

Underwater Acoustic Target Recognition Based on Generative Adversarial Network Data Augmentation

INTER-NOISE and NOISE-CON Congress and Conference Proceedings ◽

10.3397/in-2021-2737 ◽

2021 ◽

Vol 263 (2) ◽

pp. 4558-4564

Author(s):

Minghong Zhang ◽

Xinwei Luo

Keyword(s):

Data Augmentation ◽

Target Recognition ◽

Training Data ◽

Small Samples ◽

Generative Adversarial Network ◽

Data Set ◽

Underwater Acoustic ◽

Adversarial Network ◽

Acoustic Target ◽

The Impact

Underwater acoustic target recognition is an important aspect of underwater acoustic research. In recent years, machine learning has been developed continuously, which is widely and effectively applied in underwater acoustic target recognition. In order to acquire good recognition results and reduce the problem of overfitting, Adequate data sets are essential. However, underwater acoustic samples are relatively rare, which has a certain impact on recognition accuracy. In this paper, in addition of the traditional audio data augmentation method, a new method of data augmentation using generative adversarial network is proposed, which uses generator and discriminator to learn the characteristics of underwater acoustic samples, so as to generate reliable underwater acoustic signals to expand the training data set. The expanded data set is input into the deep neural network, and the transfer learning method is applied to further reduce the impact caused by small samples by fixing part of the pre-trained parameters. The experimental results show that the recognition result of this method is better than the general underwater acoustic recognition method, and the effectiveness of this method is verified.

Download Full-text

The impact of methodological decisions on climate reconstructions using WA-PLS

Quaternary Research ◽

10.1017/qua.2020.44 ◽

2020 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Mark G. Turner ◽

Dongyang Wei ◽

Iain Colin Prentice ◽

Sandy P. Harrison

Keyword(s):

Reference Data ◽

Training Data ◽

Weighted Averaging ◽

Data Set ◽

Last Glacial Period ◽

Methodological Choices ◽

Climate Reconstructions ◽

Significant Difference ◽

Climate Space ◽

The Impact

Abstract Most techniques for pollen-based quantitative climate reconstruction use modern assemblages as a reference data set. We examine the implication of methodological choices in the selection and treatment of the reference data set for climate reconstructions using Weighted Averaging Partial Least Squares (WA-PLS) regression and records of the last glacial period from Europe. We show that the training data set used is important because it determines the climate space sampled. The range and continuity of sampling along the climate gradient is more important than sampling density. Reconstruction uncertainties are generally reduced when more taxa are included, but combining related taxa that are poorly sampled in the data set to a higher taxonomic level provides more stable reconstructions. Excluding taxa that are climatically insensitive, or systematically overrepresented in fossil pollen assemblages because of known biases in pollen production or transport, makes no significant difference to the reconstructions. However, the exclusion of taxa overrepresented because of preservation issues does produce an improvement. These findings are relevant not only for WA-PLS reconstructions but also for similar approaches using modern assemblage reference data. There is no universal solution to these issues, but we propose a number of checks to evaluate the robustness of pollen-based reconstructions.

Download Full-text

Probability Density Machine: A New Solution of Class Imbalance Learning

Scientific Programming ◽

10.1155/2021/7555587 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Ruihan Cheng ◽

Longfei Zhang ◽

Shiqi Wu ◽

Sen Xu ◽

Shang Gao ◽

...

Keyword(s):

Probability Density ◽

Predictive Model ◽

Data Distribution ◽

Class Imbalance ◽

Imbalanced Data ◽

Training Data ◽

Probability Density Estimation ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

The Impact

Class imbalance learning (CIL) is an important branch of machine learning as, in general, it is difficult for classification models to learn from imbalanced data; meanwhile, skewed data distribution frequently exists in various real-world applications. In this paper, we introduce a novel solution of CIL called Probability Density Machine (PDM). First, in the context of Gaussian Naive Bayes (GNB) predictive model, we analyze the reason why imbalanced data distribution makes the performance of predictive model decline in theory and draw a conclusion regarding the impact of class imbalance that is only associated with the prior probability, but does not relate to the conditional probability of training data. Then, in such context, we show the rationality of several traditional CIL techniques. Furthermore, we indicate the drawback of combining GNB with these traditional CIL techniques. Next, profiting from the idea of K-nearest neighbors probability density estimation (KNN-PDE), we propose the PDM which is an improved GNB-based CIL algorithm. Finally, we conduct experiments on lots of class imbalance data sets, and the proposed PDM algorithm shows the promising results.

Download Full-text

A deep learning segmentation strategy that minimizes the amount of manually annotated images

F1000Research ◽

10.12688/f1000research.52026.1 ◽

2021 ◽

Vol 10 ◽

pp. 256

Author(s):

Thierry Pécot ◽

Alexander Alekseyenko ◽

Kristin Wallace

Keyword(s):

Deep Learning ◽

Data Augmentation ◽

Training Data ◽

Artificial Data ◽

Deep Convolutional Neural Networks ◽

Generative Adversarial Network ◽

Data Set ◽

Segmentation Strategy ◽

Adversarial Network ◽

The Impact

Deep learning has revolutionized the automatic processing of images. While deep convolutional neural networks have demonstrated astonishing segmentation results for many biological objects acquired with microscopy, this technology's good performance relies on large training datasets. In this paper, we present a strategy to minimize the amount of time spent in manually annotating images for segmentation. It involves using an efficient and open source annotation tool, the artificial increase of the training data set with data augmentation, the creation of an artificial data set with a conditional generative adversarial network and the combination of semantic and instance segmentations. We evaluate the impact of each of these approaches for the segmentation of nuclei in 2D widefield images of human precancerous polyp biopsies in order to define an optimal strategy.

Download Full-text

Automated Classification of Circulating Tumor Cells and the Impact of Interobsever Variability on Classifier Training and Performance

Journal of Immunology Research ◽

10.1155/2015/573165 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 17

Author(s):

Carl-Magnus Svensson ◽

Ron Hübler ◽

Marc Thilo Figge

Keyword(s):

Random Forest ◽

Tumor Cells ◽

Circulating Tumor Cells ◽

Interobserver Variability ◽

Ground Truth ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Data Set ◽

The Impact

Application of personalized medicine requires integration of different data to determine each patient’s unique clinical constitution. The automated analysis of medical data is a growing field where different machine learning techniques are used to minimize the time-consuming task of manual analysis. The evaluation, and often training, of automated classifiers requires manually labelled data as ground truth. In many cases such labelling is not perfect, either because of the data being ambiguous even for a trained expert or because of mistakes. Here we investigated the interobserver variability of image data comprising fluorescently stained circulating tumor cells and its effect on the performance of two automated classifiers, a random forest and a support vector machine. We found that uncertainty in annotation between observers limited the performance of the automated classifiers, especially when it was included in the test set on which classifier performance was measured. The random forest classifier turned out to be resilient to uncertainty in the training data while the support vector machine’s performance is highly dependent on the amount of uncertainty in the training data. We finally introduced the consensus data set as a possible solution for evaluation of automated classifiers that minimizes the penalty of interobserver variability.

Download Full-text

Analysis of non-iterative phase retrieval based on machine learning

Optical Review ◽

10.1007/s10043-019-00574-8 ◽

2020 ◽

Vol 27 (1) ◽

pp. 136-141

Author(s):

Yohei Nishizaki ◽

Ryoichi Horisaki ◽

Katsuhisa Kitaguchi ◽

Mamoru Saito ◽

Jun Tanida

Keyword(s):

Machine Learning ◽

Phase Retrieval ◽

Training Data ◽

Noise Robustness ◽

Data Set ◽

Retrieval Method ◽

Monitoring Applications ◽

Conventional Methods ◽

Noisy Measurements ◽

The Impact

AbstractIn this paper, we analyze a machine-learning-based non-iterative phase retrieval method. Phase retrieval and its applications have been attractive research topics in optics and photonics, for example, in biomedical imaging, astronomical imaging, and so on. Most conventional phase retrieval methods have used iterative processes to recover phase information; however, the calculation speed and convergence with these methods are serious issues in real-time monitoring applications. Machine-learning-based methods are promising for addressing these issues. Here, we numerically compare conventional methods and a machine-learning-based method in which a convolutional neural network is employed. Simulations with several conditions show that the machine-learning-based method realizes fast and robust phase recovery compared with the conventional methods. We also numerically demonstrate machine-learning-based phase retrieval from noisy measurements with a noisy training data set for improving the noise robustness. The machine-learning-based approach used in this study may increase the impact of phase retrieval, which is useful in various fields, where phase retrieval has been used as a fundamental tool.

Download Full-text

First-Year Retention and National Panhellenic Conference Sorority Membership

Journal of College Student Retention Research Theory & Practice ◽

10.1177/1521025116656633 ◽

2016 ◽

Vol 20 (2) ◽

pp. 236-252 ◽

Cited By ~ 1

Author(s):

J. Patrick Biddix ◽

Kaitlin I. Singer ◽

Emilie Aslinger

Keyword(s):

Predictive Model ◽

Colleges And Universities ◽

Predictive Analysis ◽

First Year ◽

Data Set ◽

Institutional Influences ◽

First Year Retention ◽

Recommendations For Practice ◽

The Impact ◽

Educational Component

This study examines the impact of joining a National Panhellenic Conference sorority on first-year retention. We compiled and analyzed data for first-year sorority members from 86 chapters using institutional records at sixteen 4-year colleges and universities. A total of 4,243 cases comprised the data set, which included records for 2,104 members and 2,139 nonmembers for comparison. The predictive model incorporated controls for background and institutional influences. Findings from predictive analysis showed women who joined sororities were three times more likely to return for their sophomore year. We offer recommendations for practice to strengthen the educational component of sorority membership.

Download Full-text

Dataset Size Sensitivity Analysis of Machine Learning Classifiers to Differentiate Molecular Markers of Pediatric Low-Grade Gliomas Based on MRI

10.21203/rs.3.rs-883606/v1 ◽

2021 ◽

Author(s):

Matthias W. Wagner ◽

Khashayar Namdar ◽

Abdullah Alqabbani ◽

Nicolin Hainc ◽

Liana Nobre Figuereido ◽

...

Keyword(s):

Machine Learning ◽

Sample Size ◽

Brain Mri ◽

Training Data ◽

Low Grade ◽

Validation Data ◽

Data Set ◽

Low Grade Gliomas ◽

Braf Fusion ◽

The Impact

Abstract Machine learning (ML) approaches can predict BRAF status of pediatric low-grade gliomas (pLGG) on pre-therapeutic brain MRI. The impact of training data sample size and type of ML model is not established. In this bi-institutional retrospective study, 251 pLGG FLAIR MRI datasets from 2 children’s hospitals were included. Radiomics features were extracted from tumor segmentations and five models (Random Forest, XGBoost, Neural Network (NN) 1 (100:20:2), NN2 (50:10:2), NN3 (50:20:10:2)) were tested to classify them. Classifiers were cross-validated on data from institution 1 and validated on data from institution 2. Starting with 10% of the training data, models were cross-validated using a 4-fold approach at every step with an additional 2.25% increase in sample size. Two-hundred-twenty patients (mean age 8.53 ± 4.94 years, 114 males, 67% BRAF fusion) were included in the training dataset, and 31 patients (mean age 7.97±6.20 years, 18 males, 77% BRAF fusion) in the independent test dataset. NN1 (100:20:2) yielded the highest area under the receiver operating characteristic curve (AUC). It predicted BRAF status with a mean AUC of 0.85, 95% CI [0.83, 0.87] using 60% of the training data and with mean AUC of 0.83, 95% CI [0.82, 0.84] on the independent validation data set.

Download Full-text

A machine learning-based treatment prediction model using whole genome variants of hepatitis C virus

PLoS ONE ◽

10.1371/journal.pone.0242028 ◽

2020 ◽

Vol 15 (11) ◽

pp. e0242028

Author(s):

Hiroaki Haga ◽

Hidenori Sato ◽

Ayumi Koseki ◽

Takafumi Saito ◽

Kazuo Okumoto ◽

...

Keyword(s):

Machine Learning ◽

Hepatitis C Virus ◽

Hepatitis C ◽

Prediction Model ◽

Predictive Model ◽

Machine Learning Algorithms ◽

Training Data ◽

Whole Genome ◽

Validation Data ◽

Data Set

In recent years, the development of diagnostics using artificial intelligence (AI) has been remarkable. AI algorithms can go beyond human reasoning and build diagnostic models from a number of complex combinations. Using next-generation sequencing technology, we identified hepatitis C virus (HCV) variants resistant to directing-acting antivirals (DAA) by whole genome sequencing of full-length HCV genomes, and applied these variants to various machine-learning algorithms to evaluate a preliminary predictive model. HCV genomic RNA was extracted from serum from 173 patients (109 with subsequent sustained virological response [SVR] and 64 without) before DAA treatment. HCV genomes from the 109 SVR and 64 non-SVR patients were randomly divided into a training data set (57 SVR and 29 non-SVR) and a validation-data set (52 SVR and 35 non-SVR). The training data set was subject to nine machine-learning algorithms selected to identify the optimized combination of functional variants in relation to SVR status following DAA therapy. Subsequently, the prediction model was tested by the validation-data set. The most accurate learning method was the support vector machine (SVM) algorithm (validation accuracy, 0.95; kappa statistic, 0.90; F-value, 0.94). The second-most accurate learning algorithm was Multi-layer perceptron. Unfortunately, Decision Tree, and Naive Bayes algorithms could not be fitted with our data set due to low accuracy (< 0.8). Conclusively, with an accuracy rate of 95.4% in the generalization performance evaluation, SVM was identified as the best algorithm. Analytical methods based on genomic analysis and the construction of a predictive model by machine-learning may be applicable to the selection of the optimal treatment for other viral infections and cancer.

Download Full-text

Weakly Supervised Sensitive Heatmap framework to classify and localize diabetic retinopathy lesions

Scientific Reports ◽

10.1038/s41598-021-02834-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Mohammed Al-Mukhtar ◽

Ameer Hussein Morad ◽

Mustafa Albadri ◽

MD Samiul Islam

Keyword(s):

Diabetic Retinopathy ◽

Supervised Learning ◽

Automatic Detection ◽

Training Data ◽

Optic Disk ◽

Test Accuracy ◽

Fundus Images ◽

Data Set ◽

Weakly Supervised ◽

The Impact

AbstractVision loss happens due to diabetic retinopathy (DR) in severe stages. Thus, an automatic detection method applied to diagnose DR in an earlier phase may help medical doctors to make better decisions. DR is considered one of the main risks, leading to blindness. Computer-Aided Diagnosis systems play an essential role in detecting features in fundus images. Fundus images may include blood vessels, exudates, micro-aneurysm, hemorrhages, and neovascularization. In this paper, our model combines automatic detection for the diabetic retinopathy classification with localization methods depending on weakly-supervised learning. The model has four stages; in stage one, various preprocessing techniques are applied to smooth the data set. In stage two, the network had gotten deeply to the optic disk segment for eliminating any exudate's false prediction because the exudates had the same color pixel as the optic disk. In stage three, the network is fed through training data to classify each label. Finally, the layers of the convolution neural network are re-edited, and used to localize the impact of DR on the patient's eye. The framework tackles the matching technique between two essential concepts where the classification problem depends on the supervised learning method. While the localization problem was obtained by the weakly supervised method. An additional layer known as weakly supervised sensitive heat map (WSSH) was added to detect the ROI of the lesion at a test accuracy of 98.65%, while comparing with Class Activation Map that involved weakly supervised technology achieved 0.954. The main purpose is to learn a representation that collect the central localization of discriminative features in a retina image. CNN-WSSH model is able to highlight decisive features in a single forward pass for getting the best detection of lesions.

Download Full-text