resampling method
Recently Published Documents


TOTAL DOCUMENTS

185
(FIVE YEARS 49)

H-INDEX

20
(FIVE YEARS 2)

Author(s):  
Yang Huang ◽  
Duen-Ren Liu ◽  
Shin-Jye Lee ◽  
Chia-Hao Hsu ◽  
Yang-Guang Liu

Author(s):  
Christian Bruch

AbstractIn this paper, we propose a method that estimates the variance of an imputed estimator in a multistage sampling design. The method is based on the rescaling bootstrap for multistage sampling introduced by Preston (Surv Methodol 35(2):227–234, 2009). In his original version, this resampling method requires that the dataset includes only complete cases and no missing values. Thus, we propose two modifications for applying this method to nonresponse and imputation. These modifications are compared to other modifications in a Monte Carlo simulation study. The results of our simulation study show that our two proposed approaches are superior to the other modifications of the rescaling bootstrap and, in many situations, produce valid estimators for the variance of the imputed estimator in multistage sampling designs.


Foods ◽  
2021 ◽  
Vol 10 (10) ◽  
pp. 2472
Author(s):  
Shogo Okamoto

In the last decade, temporal dominance of sensations (TDS) methods have proven to be potent approaches in the field of food sciences. Accordingly, thus far, methods for analyzing TDS curves, which are the major outputs of TDS methods, have been developed. This study proposes a method of bootstrap resampling for TDS tasks. The proposed method enables the production of random TDS curves to estimate the uncertainties, that is, the 95% confidence interval and standard error of the curves. Based on Monte Carlo simulation studies, the estimated uncertainties are considered valid and match those estimated by approximated normal distributions with the number of independent TDS tasks or samples being 50–100 or greater. The proposed resampling method enables researchers to apply statistical analyses and machine-learning approaches that require a large sample size of TDS curves.


2021 ◽  
Vol 9 ◽  
Author(s):  
Daniel Lowell Weller ◽  
Tanzy M. T. Love ◽  
Martin Wiedmann

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.


2021 ◽  
Author(s):  
Jo-Anne Bright ◽  
Duncan Alexander Taylor ◽  
James Michael Curran ◽  
JOHN BUCKLETON

Two methods for applying a lower bound to the variation induced by the Monte Carlo effect are trialled. One of these is implemented in the widely used probabilistic genotyping system, STRmix Neither approach is giving the desired 99% coverage. In some cases the coverage is much lower than the desired 99%. The discrepancy (i.e. the distance between the LR corresponding to the desired coverage and the LR observed coverage at 99%) is not large. For example, the discrepancy of 0.23 for approach 1 suggests the lower bounds should be moved downwards by a factor of 1.7 to achieve the desired 99% coverage. Although less effective than desired these methods provide a layer of conservatism that is additional to the other layers. These other layers are from factors such as the conservatism within the sub-population model, the choice of conservative measures of co-ancestry, the consideration of relatives within the population and the resampling method used for allele probabilities, all of which tend to understate the strength of the findings.


2021 ◽  
Vol 20 (Number 3) ◽  
pp. 423-456
Author(s):  
Adil Yaseen Taha ◽  
Sabrina Tiun ◽  
Abdul Hadi Abd Rahman ◽  
Ali Sabah

Simultaneous multiple labelling of documents, also known as multilabel text classification, will not perform optimally if the class is highly imbalanced. Class imbalanced entails skewness in the fundamental data for distribution that leads to more difficulty in classification. Random over-sampling and under-sampling are common approaches to solve the class imbalanced problem. However, these approaches have several drawbacks; the under-sampling is likely to dispose of useful data, whereas the over-sampling can heighten the probability of overfitting. Therefore, a new method that can avoid discarding useful data and overfitting problems is needed. This study proposes a method to tackle the class imbalanced problem by combining multilabel over-sampling and under-sampling with class alignment (ML-OUSCA). In the proposed ML-OUSCA, instead of using all the training instances, it draws a new training set by over-sampling small size classes and under-sampling big size classes. To evaluate our proposed ML-OUSCA, evaluation metrics of average precision, average recall and average F-measure on three benchmark datasets, namely, Reuters-21578, Bibtex, and Enron datasets, were performed. Experimental results showed that the proposed ML-OUSCA outperformed the chosen baseline random resampling approaches; K-means SMOTE and KNN-US. Thus, based on the results, we can conclude that designing a resampling method based on the class imbalanced together with class alignment will improve multilabel classification even better than just the random resampling method.


Sign in / Sign up

Export Citation Format

Share Document