Non-removal strategy for outliers in predictive models: The PAELLA algorithm case

2019 ◽  
Vol 28 (4) ◽  
pp. 418-429
Author(s):  
Manuel Castejón-limas ◽  
Hector Alaiz-Moreton ◽  
Laura Fernández-Robles ◽  
Javier Alfonso-Cendón ◽  
Camino Fernández-Llamas ◽  
...  

Abstract This paper reports the experience of using the PAELLA algorithm as a helper tool in robust regression instead of as originally intended for outlier identification and removal. This novel usage of the algorithm takes advantage of the occurrence vector calculated by the algorithm in order to strengthen the effect of the more reliable samples and lessen the impact of those that otherwise would be considered outliers. Following that aim, a series of experiments is conducted in order to learn how to better use the information contained in the occurrence vector. Using a contrively difficult artificial data set, a reference predictive model is fit using the whole raw dataset. The second experiment reports the results of fitting a similar predictive model but discarding the samples marked as outliers by PAELLA. The third experiment uses the occurrence vector provided by PAELLA in order to classify the observations in multiple bins and fit every possible model changing which bins are considered for fitting and which are discarded in that particular model. The fourth experiment introduces a sampling process before fitting in which the occurrence vector represents the likelihood of being considered in the training data set. The fifth experiment considers the sampling process as an internal step to be performed interleaved between the training epochs. The last experiment compares our approach using weighted neural networks to a state of the art method.

2021 ◽  
Vol 263 (2) ◽  
pp. 4558-4564
Author(s):  
Minghong Zhang ◽  
Xinwei Luo

Underwater acoustic target recognition is an important aspect of underwater acoustic research. In recent years, machine learning has been developed continuously, which is widely and effectively applied in underwater acoustic target recognition. In order to acquire good recognition results and reduce the problem of overfitting, Adequate data sets are essential. However, underwater acoustic samples are relatively rare, which has a certain impact on recognition accuracy. In this paper, in addition of the traditional audio data augmentation method, a new method of data augmentation using generative adversarial network is proposed, which uses generator and discriminator to learn the characteristics of underwater acoustic samples, so as to generate reliable underwater acoustic signals to expand the training data set. The expanded data set is input into the deep neural network, and the transfer learning method is applied to further reduce the impact caused by small samples by fixing part of the pre-trained parameters. The experimental results show that the recognition result of this method is better than the general underwater acoustic recognition method, and the effectiveness of this method is verified.


2020 ◽  
pp. 1-16 ◽  
Author(s):  
Mark G. Turner ◽  
Dongyang Wei ◽  
Iain Colin Prentice ◽  
Sandy P. Harrison

Abstract Most techniques for pollen-based quantitative climate reconstruction use modern assemblages as a reference data set. We examine the implication of methodological choices in the selection and treatment of the reference data set for climate reconstructions using Weighted Averaging Partial Least Squares (WA-PLS) regression and records of the last glacial period from Europe. We show that the training data set used is important because it determines the climate space sampled. The range and continuity of sampling along the climate gradient is more important than sampling density. Reconstruction uncertainties are generally reduced when more taxa are included, but combining related taxa that are poorly sampled in the data set to a higher taxonomic level provides more stable reconstructions. Excluding taxa that are climatically insensitive, or systematically overrepresented in fossil pollen assemblages because of known biases in pollen production or transport, makes no significant difference to the reconstructions. However, the exclusion of taxa overrepresented because of preservation issues does produce an improvement. These findings are relevant not only for WA-PLS reconstructions but also for similar approaches using modern assemblage reference data. There is no universal solution to these issues, but we propose a number of checks to evaluate the robustness of pollen-based reconstructions.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Ruihan Cheng ◽  
Longfei Zhang ◽  
Shiqi Wu ◽  
Sen Xu ◽  
Shang Gao ◽  
...  

Class imbalance learning (CIL) is an important branch of machine learning as, in general, it is difficult for classification models to learn from imbalanced data; meanwhile, skewed data distribution frequently exists in various real-world applications. In this paper, we introduce a novel solution of CIL called Probability Density Machine (PDM). First, in the context of Gaussian Naive Bayes (GNB) predictive model, we analyze the reason why imbalanced data distribution makes the performance of predictive model decline in theory and draw a conclusion regarding the impact of class imbalance that is only associated with the prior probability, but does not relate to the conditional probability of training data. Then, in such context, we show the rationality of several traditional CIL techniques. Furthermore, we indicate the drawback of combining GNB with these traditional CIL techniques. Next, profiting from the idea of K-nearest neighbors probability density estimation (KNN-PDE), we propose the PDM which is an improved GNB-based CIL algorithm. Finally, we conduct experiments on lots of class imbalance data sets, and the proposed PDM algorithm shows the promising results.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 256
Author(s):  
Thierry Pécot ◽  
Alexander Alekseyenko ◽  
Kristin Wallace

Deep learning has revolutionized the automatic processing of images. While deep convolutional neural networks have demonstrated astonishing segmentation results for many biological objects acquired with microscopy, this technology's good performance relies on large training datasets. In this paper, we present a strategy to minimize the amount of time spent in manually annotating images for segmentation. It involves using an efficient and open source annotation tool, the artificial increase of the training data set with data augmentation, the creation of an artificial data set with a conditional generative adversarial network and the combination of semantic and instance segmentations. We evaluate the impact of each of these approaches for the segmentation of nuclei in 2D widefield images of human precancerous polyp biopsies in order to define an optimal strategy.


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Carl-Magnus Svensson ◽  
Ron Hübler ◽  
Marc Thilo Figge

Application of personalized medicine requires integration of different data to determine each patient’s unique clinical constitution. The automated analysis of medical data is a growing field where different machine learning techniques are used to minimize the time-consuming task of manual analysis. The evaluation, and often training, of automated classifiers requires manually labelled data as ground truth. In many cases such labelling is not perfect, either because of the data being ambiguous even for a trained expert or because of mistakes. Here we investigated the interobserver variability of image data comprising fluorescently stained circulating tumor cells and its effect on the performance of two automated classifiers, a random forest and a support vector machine. We found that uncertainty in annotation between observers limited the performance of the automated classifiers, especially when it was included in the test set on which classifier performance was measured. The random forest classifier turned out to be resilient to uncertainty in the training data while the support vector machine’s performance is highly dependent on the amount of uncertainty in the training data. We finally introduced the consensus data set as a possible solution for evaluation of automated classifiers that minimizes the penalty of interobserver variability.


2020 ◽  
Vol 27 (1) ◽  
pp. 136-141
Author(s):  
Yohei Nishizaki ◽  
Ryoichi Horisaki ◽  
Katsuhisa Kitaguchi ◽  
Mamoru Saito ◽  
Jun Tanida

AbstractIn this paper, we analyze a machine-learning-based non-iterative phase retrieval method. Phase retrieval and its applications have been attractive research topics in optics and photonics, for example, in biomedical imaging, astronomical imaging, and so on. Most conventional phase retrieval methods have used iterative processes to recover phase information; however, the calculation speed and convergence with these methods are serious issues in real-time monitoring applications. Machine-learning-based methods are promising for addressing these issues. Here, we numerically compare conventional methods and a machine-learning-based method in which a convolutional neural network is employed. Simulations with several conditions show that the machine-learning-based method realizes fast and robust phase recovery compared with the conventional methods. We also numerically demonstrate machine-learning-based phase retrieval from noisy measurements with a noisy training data set for improving the noise robustness. The machine-learning-based approach used in this study may increase the impact of phase retrieval, which is useful in various fields, where phase retrieval has been used as a fundamental tool.


2021 ◽  
Author(s):  
Mohammed Almukhtar ◽  
Ameer Morad ◽  
Mustafa Albadri ◽  
MD Islam

Abstract Vision loss happens due to diabetic retinopathy (DR) in severe stages. Thus, an automatic detection method applied to diagnose DR in an earlier phase may help medical doctors to make better decisions. DR is considered one of the main risks, leading to blindness. Computer-Aided Diagnosis (CAD) systems play an essential role in detecting features in fundus images. Fundus images may include blood vessel area, exudates, micro-aneurysm, hemorrhages, and neovascularization. In this paper, our model combines automatic detection for the diabetic retinopathy classification with localization methods depending on weakly-supervised learning. The model has four stages; in stage one, various preprocessing techniques are applied for smoothing the data set. In stage two, the network had gotten deeply to the optic disk segment for eliminating any exudate's false prediction because the exudates had the same color pixel as the optic disk. Stage three, the network is fed through training data to classify each class label. Finally, the layers of the convolution neural network are re-edited, and the layers are used to localize the impact of DR on the eye's patient. The framework tackled the matching technique between two essential concepts where the classification problem depends on the supervised learning method. In comparison, the localization problem was obtained by the weakly supervised method. An additional layer known as weakly supervised sensitive heat map (WSSH) was added to detect the ROI of the lesion at a test accuracy of 98.65%.


2019 ◽  
Vol 27 (2) ◽  
pp. 53-62
Author(s):  
Sebastian Gnat

Abstract The introduction of an ad valorem tax can lead to an increase in the tax burden on real estate. There are concerns that this increase will be large and widespread. Before undertaking any actual actions related to the real estate tax reform, pilot studies and statistical analyses need to be conducted in order to verify the validity of those concerns and other aspects regarding the replacement of a real estate tax, agricultural tax and forest tax with an ad valorem tax. The article presents results of research on the effectiveness of the classification of real estate into a group at risk of an increase of tax burden with the use of the k-nearest neighbors method. The main focus was to determine the size of a real estate set (training data set) on the basis of which classification is conducted, as well as on the efficiency of that classification, depending on the size of such data set.


2017 ◽  
Author(s):  
Atilla Özgür ◽  
Hamit Erdem

This study investigates the effects of using a large data set on supervised machine learning classifiers in the domain of Intrusion Detection Systems (IDS). To investigate this effect 12 machine learning algorithms have been applied. These algorithms are: (1) Adaboost, (2) Bayesian Nets, (3) Decision Tables, (4) Decision Trees (J48), (5)Logistic Regression, (6) Multi-Layer Perceptron, (7) Naive Bayes, (8) OneRule, (9)Random Forests, (10) Radial Basis Function Neural Networks, (11) Support Vector Machines (two different training algorithms), and (12) ZeroR. A well-known IDS benchmark dataset, KDD99 has been used to train and test classifiers. Full training data set of KDD99 is 4.9 million instances while full test dataset is 311,000 instances. In contrast to similar previous studies, which used 0.08%–10% for training and 1.2%–100% for testing, this study uses full training dataset and full test dataset. Weka Machine Learning Toolbox has been used for modeling and simulation. The performance of classifiers has been evaluated using standard binary performance metrics: Detection Rate, True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Precision, and F1-Rate. To show effects of dataset size, performance of classifiers has been also evaluated using following hardware metrics: Training Time, Working Memory and Model Size. Test results shows improvements in classifiers in standard performance metrics compared to previous studies.


2016 ◽  
Vol 20 (2) ◽  
pp. 236-252 ◽  
Author(s):  
J. Patrick Biddix ◽  
Kaitlin I. Singer ◽  
Emilie Aslinger

This study examines the impact of joining a National Panhellenic Conference sorority on first-year retention. We compiled and analyzed data for first-year sorority members from 86 chapters using institutional records at sixteen 4-year colleges and universities. A total of 4,243 cases comprised the data set, which included records for 2,104 members and 2,139 nonmembers for comparison. The predictive model incorporated controls for background and institutional influences. Findings from predictive analysis showed women who joined sororities were three times more likely to return for their sophomore year. We offer recommendations for practice to strengthen the educational component of sorority membership.


Sign in / Sign up

Export Citation Format

Share Document