Evaluating Sampling Methods for Reusing Knowledge From Large and Ill-Structured Qualitative Data Sets

Author(s):  
Jacob Nelson ◽  
G. Austin Marrs ◽  
Greg Schmidt ◽  
Joseph A. Donndelinger ◽  
Robert L. Nagel

The desire to use ever growing qualitative data sets of user generated content in the engineering design process in a computationally effective manner makes it increasingly necessary to draw representative samples. This work investigated the ability of alternative sampling algorithms to draw samples with conformance to characteristics of the original data set. Sampling methods investigated included: random sampling, interval sampling, fixed-increment (or systematic) sampling method, and stratified sampling. Data collected through the Vehicle Owner’s Questionnaire, a survey administered by the U.S. National Highway Traffic Safety Administration, is used as a case study throughout this paper. The paper demonstrates that existing statistical methods may be used to evaluate goodness of fit for samples drawn from large bodies of qualitative data. Evaluation of goodness of fit not only provides confidence that a sample is representative of the data set from which it is drawn, but also yields valuable real-time feedback during the sampling process. This investigation revealed two interesting and counterintuitive trends in sampling algorithm performance. The first is that larger sample sizes do not necessarily lead to improved goodness of fit. The second is that depending on the details of implementation, data cleansing may degrade performance of data sampling algorithms rather than improving it. This work illustrates the importance of aligning sampling procedures to data structures and validating the conformance of samples to characteristics of the larger data set to avoid drawing erroneous conclusions based on unexpectedly biased samples of data.

Author(s):  
L Mohana Tirumala ◽  
S. Srinivasa Rao

Privacy preserving in Data mining & publishing, plays a major role in today networked world. It is important to preserve the privacy of the vital information corresponding to a data set. This process can be achieved by k-anonymization solution for classification. Along with the privacy preserving using anonymization, yielding the optimized data sets is also of equal importance with a cost effective approach. In this paper Top-Down Refinement algorithm has been proposed which yields optimum results in a cost effective manner. Bayesian Classification has been proposed in this paper to predict class membership probabilities for a data tuple for which the associated class label is unknown.


2016 ◽  
Vol 72 (6) ◽  
pp. 696-703 ◽  
Author(s):  
Julian Henn

An alternative measure to the goodness of fit (GoF) is developed and applied to experimental data. The alternative goodness of fit squared (aGoFs) demonstrates that the GoF regularly fails to provide evidence for the presence of systematic errors, because certain requirements are not met. These requirements are briefly discussed. It is shown that in many experimental data sets a correlation between the squared residuals and the variance of observed intensities exists. These correlations corrupt the GoF and lead to artificially reduced values in the GoF and in the numerical value of thewR(F2). Remaining systematic errors in the data sets are veiled by this mechanism. In data sets where these correlations do not appear for the entire data set, they often appear for the decile of largest variances of observed intensities. Additionally, statistical errors for the squared goodness of fit, GoFs, and the aGoFs are developed and applied to experimental data. This measure shows how significantly the GoFs and aGoFs deviate from the ideal value one.


Author(s):  
J. DIEBOLT ◽  
M.-A. EL-AROUI ◽  
V. DURBEC ◽  
B. VILLAIN

When extreme quantiles have to be estimated from a given data set, the classical parametric approach can lead to very poor estimations. This has led to the introduction of specific methods for estimating extreme quantiles (MEEQ's) in a nonparametric spirit, e.g., Pickands excess method, methods based on Hill's estimate of the Pareto index, exponential tail (ET) and quadratic tail (QT) methods. However, no practical technique for assessing and comparing these MEEQ's when they are to be used on a given data set is available. This paper is a first attempt to provide such techniques. We first compare the estimations given by the main MEEQ's on several simulated data sets. Then we suggest goodness-of-fit (Gof) tests to assess the MEEQ's by measuring the quality of their underlying approximations. It is shown that Gof techniques bring very relevant tools to assess and compare ET and excess methods. Other empirical criterions for comparing MEEQ's are also proposed and studied through Monte-Carlo analyses. Finally, these assessment and comparison techniques are experimented on real-data sets issued from an industrial context where extreme quantiles are needed to define maintenance policies.


2001 ◽  
Vol 67 (5) ◽  
pp. 2129-2135 ◽  
Author(s):  
Lihui Zhao ◽  
Yuhuan Chen ◽  
Donald W. Schaffner

ABSTRACT Percentage is widely used to describe different results in food microbiology, e.g., probability of microbial growth, percent inactivated, and percent of positive samples. Four sets of percentage data, percent-growth-positive, germination extent, probability for one cell to grow, and maximum fraction of positive tubes, were obtained from our own experiments and the literature. These data were modeled using linear and logistic regression. Five methods were used to compare the goodness of fit of the two models: percentage of predictions closer to observations, range of the differences (predicted value minus observed value), deviation of the model, linear regression between the observed and predicted values, and bias and accuracy factors. Logistic regression was a better predictor of at least 78% of the observations in all four data sets. In all cases, the deviation of logistic models was much smaller. The linear correlation between observations and logistic predictions was always stronger. Validation (accomplished using part of one data set) also demonstrated that the logistic model was more accurate in predicting new data points. Bias and accuracy factors were found to be less informative when evaluating models developed for percentage data, since neither of these indices can compare predictions at zero. Model simplification for the logistic model was demonstrated with one data set. The simplified model was as powerful in making predictions as the full linear model, and it also gave clearer insight in determining the key experimental factors.


2021 ◽  
Vol 14 (11) ◽  
pp. 2519-2532
Author(s):  
Fatemeh Nargesian ◽  
Abolfazl Asudeh ◽  
H. V. Jagadish

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.


2011 ◽  
Vol 271-273 ◽  
pp. 1291-1296
Author(s):  
Jin Wei Zhang ◽  
Hui Juan Lu ◽  
Wu Tao Chen ◽  
Yi Lu

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.


2018 ◽  
Vol 35 (10) ◽  
pp. 2094-2118 ◽  
Author(s):  
Sourish Sarkar ◽  
Balaji Rajagopalan

Purpose The purpose of this paper is to investigate the value of information in consumer safety complaints for organizational learning. Design/methodology/approach Empirical analysis of this study uses a novel secondary data set, which is formed by combining complaints data filed with the National Highway Traffic Safety Administration (NHTSA) for potential safety defects, and design change information from 2003 to 2011 model-year vehicles in the USA. Findings First, the paper demonstrates the value of information embedded in complaints. Second, in the case of radical product redesigns, owing to the lack of direct applicability of consumer feedback based learning, the impact of learning on product safety is found to be muted, third, the results suggest that the safety complaint rates vary by vehicle classes/categories and, fourth, the findings differ from prior research conclusions on vehicle quality. Prior research finds the debuting car models have the lowest repair rates among all car models produced in a given year, but the current study finds the debuting models to have the highest rates of safety complaints. Originality/value Quality management literature rarely examines the safety complaints data (which, unlike other consumer feedbacks, focuses exclusively on the safety hazards due to flaws that result in accidents). This paper fills the gap in literature by linking safety complaints with future product quality and organizational learning.


2020 ◽  
Vol 17 (15) ◽  
pp. 4043-4057
Author(s):  
Hua W. Xie ◽  
Adriana L. Romero-Olivares ◽  
Michele Guindani ◽  
Steven D. Allison

Abstract. To make predictions about the carbon cycling consequences of rising global surface temperatures, Earth system scientists rely on mathematical soil biogeochemical models (SBMs). However, it is not clear which models have better predictive accuracy, and a rigorous quantitative approach for comparing and validating the predictions has yet to be established. In this study, we present a Bayesian approach to SBM comparison that can be incorporated into a statistical model selection framework. We compared the fits of linear and nonlinear SBMs to soil respiration data compiled in a recent meta-analysis of soil warming field experiments. Fit quality was quantified using Bayesian goodness-of-fit metrics, including the widely applicable information criterion (WAIC) and leave-one-out cross validation (LOO). We found that the linear model generally outperformed the nonlinear model at fitting the meta-analysis data set. Both WAIC and LOO computed higher overfitting risk and effective numbers of parameters for the nonlinear model compared to the linear model, conditional on the data set. Goodness of fit for both models generally improved when they were initialized with lower and more realistic steady-state soil organic carbon densities. Still, testing whether linear models offer definitively superior predictive performance over nonlinear models on a global scale will require comparisons with additional site-specific data sets of suitable size and dimensionality. Such comparisons can build upon the approach defined in this study to make more rigorous statistical determinations about model accuracy while leveraging emerging data sets, such as those from long-term ecological research experiments.


1988 ◽  
Vol 21 (1) ◽  
pp. 22-28 ◽  
Author(s):  
J. K. Maichle ◽  
J. Ihringer ◽  
W. Prandl

A technique has been developed for the simultaneous analysis of several powder diffraction data on the basis of the Rietveld method. Counting rates from one specimen at a given temperature taken at neutron, synchrotron or X-ray powder diffractometers are joined to one single data set with weights given by the counting statistics. The structure is refined from this data set with a parameter field containing one structural model and individual zero points, scale factors and FWHM parameters for each of the methods and data sets. A new definition of the residuals is given. The residuals and goodness-of-fit values are calculated for all as well as for the individual data sets.


Fractals ◽  
2006 ◽  
Vol 14 (02) ◽  
pp. 143-148 ◽  
Author(s):  
H. MILLÁN ◽  
M. AGUILAR ◽  
J. DOMÌNGUEZ ◽  
L. CÈSPEDES ◽  
E. VELASCO ◽  
...  

Fractals are important for studying the physics of water transport in soils. Many authors have assumed a mass fractal structure while others consider a fractal surface approach. Each model needs comparisons on the same data set in terms of goodness-of-fit and physical interpretation of parameters. In this note, it is shown, with some representative data sets, that a pore-solid interface fractal model could fit soil water retention data better than a mass fractal model. In addition to the interfacial fractal dimension, this model predicts the tension at dryness. This value is very close to 106 kPa as theoretically predicted.


Sign in / Sign up

Export Citation Format

Share Document