scholarly journals Missing by Design Patterns for Optimizing Survey Response by Efficient and Consistent Data Collection

2020 ◽  
Author(s):  
◽  
Sara Bahrami

Respondent burden due to long questionnaires in surveys can negatively affect the response rate as well as the quality of responses. A solution to this problem is to use split questionnaire design (SQD). In an SQD, the items of the long questionnaire are divided into subsets and only a fraction of item-subsets are assigned to random subsamples of individuals. This will lead to several shorter questionnaires which are administered to random subsample of individuals. The completed sub-questionnaires are then combined and the missing values due to design are imputed by means of multiple imputation method. Identification problems can be avoided in advance by ensuring that the combination of variables in the analysis model of interest are jointly observed on at least a subsample of individuals. Furthermore, including an appropriate combination of items in each sub-questionnaire is the most important concern in designing the SQD to reduce the information loss, i.e. highly correlated items that explain each other well should not be jointly missing. For this reason, training data must be available from previous surveys or a pilot study to exploit the association between the variables. In this thesis two SQDs are proposed. In the first study a potential design for NEPS data is introduced. The data consist of items which can be divided and allocated into blocks according to their context, with the objective that the within block correlations are higher relative to the between block correlations. According to the design, the target sample is divided to subsamples. In addition to the items of a whole block which is assigned to each subsample, a fraction of items of the remaining blocks are randomly drawn and assigned to each subsample. Where items that belong to blocks with relatively higher correlations are drawn with lower probability. The design is evaluated by means of several ex-post investigations. The design is imposed on complete data and several models are estimated for both complete data and data deleted by design. The design is also compared with a random multiple matrix sampling design which assigns random subset of items to each sample individual. In the second study, a genetic algorithm is used to search among a vast number of SQDs to find the optimal design. The algorithm evaluates the designs by the fraction of missing information (FMI) induced by the design. The optimal design is the one with the smallest FMI. The optimal design is evaluated by means of several simulation studies and is compared with a random MMS design.

2021 ◽  
Author(s):  
Hansle Gwon ◽  
Imjin Ahn ◽  
Yunha Kim ◽  
Hee Jun Kang ◽  
Hyeram Seo ◽  
...  

BACKGROUND When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. OBJECTIVE The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. METHODS In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. RESULTS In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a <i>P</i> value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible <i>P</i> value, 3.05e-5, in all situations. CONCLUSIONS Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.


Genes ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 708
Author(s):  
Moran Gershoni ◽  
Joel Ira Weller ◽  
Ephraim Ezra

Yearling weight gain in male and female Israeli Holstein calves, defined as 365 × ((weight − 35)/age at weight) + 35, was analyzed from 814,729 records on 368,255 animals from 740 herds recorded between 1994 and 2021. The variance components were calculated based on valid records from 2008 through 2017 for each sex separately and both sexes jointly by a single-trait individual animal model analysis, which accounted for repeat records on animals. The analysis model also included the square root, linear, and quadratic effects of age at weight. Heritability and repeatability were 0.35 and 0.71 in the analysis of both sexes and similar in the single sex analyses. The regression of yearling weight gain on birth date in the complete data set was −0.96 kg/year. The complete data set was also analyzed by the same model as the variance component analysis, including both sexes and accounting for differing variance components for each sex. The genetic trend for yearling weight gain, including both sexes, was 1.02 kg/year. Genetic evaluations for yearling weight gain was positively correlated with genetic evaluations for milk, fat, protein production, and cow survival but negatively correlated with female fertility. Yearling weight gain was also correlated with the direct effect on dystocia, and increased yearling weight gain resulted in greater frequency of dystocia. Of the 1749 Israeli Holstein bulls genotyped with reliabilities >50%, 1445 had genetic evaluations. As genotyping of these bulls was performed using several single nucleotide polymorhphism (SNP) chip platforms, we included only those markers that were genotyped in >90% of the tested cohort. A total of 40,498 SNPs were retained. More than 400 markers had significant effects after permutation and correction for multiple testing (pnominal < 1 × 10−8). Considering all SNPs simultaneously, 0.69 of variance among the sires’ transmitting ability was explained. There were 24 markers with coefficients of determination for yearling weight gain >0.04. One marker, BTA-75458-no-rs on chromosome 5, explained ≈6% of the variance among the estimated breeding values for yearling weight gain. ARS-BFGL-NGS-39379 had the fifth largest coefficient of determination in the current study and was also found to have a significant effect on weight at an age of 13–14 months in a previous study on Holsteins. Significant genomic effects on yearling weight gain were mainly associated with milk production quantitative trait loci, specifically with kappa casein metabolism.


Equity ◽  
2018 ◽  
pp. 152-188
Author(s):  
Irit Samet

This chapter examines the clean hands doctrine, according to which a claimant who knocks on the court’s door with a hand tainted by illegality or immorality will not be given her day in court. Instead of listening to her potentially successful claim, Equity resorts to its characteristic ad hominem, flexible, morally sensitive, and ex post approach to drive him/her away. The doctrine, despite its effect on a vast number of disputes, is under-theorised and fraught with lack of clarity. The chapter first considers the sources of the considerable legal anxiety caused by the way this powerful gatekeeper operates, before discussing the underpinnings of the three traditional justifications for keeping the clean hands gatekeeper in place: coherence, deterrence, and integrity. I then make the argument that by interpreting the integrity justification as a concept rooted in moral psychology, we can understand how the incommensurable justifications relate to each other and operate in judicial reasoning. I conclude by showing how the clean hands gatekeeper creates a ‘dirty hands’ type dilemma for judges and what they can do about it.


2015 ◽  
Vol 2015 ◽  
pp. 1-14 ◽  
Author(s):  
Jaemun Sim ◽  
Jonathan Sangyun Lee ◽  
Ohbyung Kwon

In a ubiquitous environment, high-accuracy data analysis is essential because it affects real-world decision-making. However, in the real world, user-related data from information systems are often missing due to users’ concerns about privacy or lack of obligation to provide complete data. This data incompleteness can impair the accuracy of data analysis using classification algorithms, which can degrade the value of the data. Many studies have attempted to overcome these data incompleteness issues and to improve the quality of data analysis using classification algorithms. The performance of classification algorithms may be affected by the characteristics and patterns of the missing data, such as the ratio of missing data to complete data. We perform a concrete causal analysis of differences in performance of classification algorithms based on various factors. The characteristics of missing values, datasets, and imputation methods are examined. We also propose imputation and classification algorithms appropriate to different datasets and circumstances.


Author(s):  
Yang Zhao ◽  
Jianyi Zhang ◽  
Changyou Chen

Scalable Bayesian sampling is playing an important role in modern machine learning, especially in the fast-developed unsupervised-(deep)-learning models. While tremendous progresses have been achieved via scalable Bayesian sampling such as stochastic gradient MCMC (SG-MCMC) and Stein variational gradient descent (SVGD), the generated samples are typically highly correlated. Moreover, their sample-generation processes are often criticized to be inefficient. In this paper, we propose a novel self-adversarial learning framework that automatically learns a conditional generator to mimic the behavior of a Markov kernel (transition kernel). High-quality samples can be efficiently generated by direct forward passes though a learned generator. Most importantly, the learning process adopts a self-learning paradigm, requiring no information on existing Markov kernels, e.g., knowledge of how to draw samples from them. Specifically, our framework learns to use current samples, either from the generator or pre-provided training data, to update the generator such that the generated samples progressively approach a target distribution, thus it is called self-learning. Experiments on both synthetic and real datasets verify advantages of our framework, outperforming related methods in terms of both sampling efficiency and sample quality.


Author(s):  
Won-Chol Yang

Missing data is a usual drawback in many real-world applications of pattern classification. Methods of pattern classification with missing data are grouped into four types: (a) deletion of incomplete samples and classifier design using only the complete data portion, (b) imputation of missing data and learning of the classifier using the edited set, (c) use of model-based procedures and (d) use of machine learning procedures. These methods can be useful in case of small amount of missing values, but they may be unsuitable in case of relatively large amount of missing values. We proposed a method to design pattern classification model with block missing training data. First, we separated submatrices from the block missing training data. Second, we designed classification submodels using each submatrix. Third, we designed final classification model using a linear combination of these submodels. We tested the classifying accuracy rate and data usage rate of the classification model designed by means of the proposed method by simulation experiments on some datasets, and verified that the proposed method was effective from the viewpoint of classifying accuracy rate and data usage rate.


2014 ◽  
Vol 39 (2) ◽  
pp. 107-127 ◽  
Author(s):  
Artur Matyja ◽  
Krzysztof Siminski

Abstract The missing values are not uncommon in real data sets. The algorithms and methods used for the data analysis of complete data sets cannot always be applied to missing value data. In order to use the existing methods for complete data, the missing value data sets are preprocessed. The other solution to this problem is creation of new algorithms dedicated to missing value data sets. The objective of our research is to compare the preprocessing techniques and specialised algorithms and to find their most advantageous usage.


Author(s):  
Toru Matsushima ◽  
Shinji Nishiwaki ◽  
Shintarou Yamasaki ◽  
Kazuhiro Izui ◽  
Masataka Yoshimura

Minimizing brake squeal is one of the most important issues in the development of high performance braking systems. Recent advances in numerical analysis, such as finite element analysis, have enabled sophisticated analysis of brake squeal phenomena, but current design methods based on such numerical analyses still fall short in terms of providing concrete performance measures for minimizing brake squeal in high performance design drafts at the conceptual design phase. This paper proposes an optimal design method for disc brake systems that specifically aims to reduce brake squeal by appropriately modifying the shapes of the brake system components. First, the relationships between the occurrence of brake squeal and the geometry and characteristics of various components is clarified, using a simplified analysis model. Next, a new design performance measure is proposed for evaluating brake squeal performance and an optimization problem is then formulated using this performance measure as an objective function. The optimization problem is solved using Genetic Algorithms. Finally, a design example is presented to examine the features of the optimal solutions and confirm that the proposed method can yield useful design information for the development of high performance braking systems that minimize brake squeal.


Foods ◽  
2021 ◽  
Vol 10 (9) ◽  
pp. 2057
Author(s):  
Ching-Wei Cheng ◽  
Kun-Ming Lai ◽  
Wan-Yu Liu ◽  
Cheng-Han Li ◽  
Yu-Hsun Chen ◽  
...  

Although many ultraviolet-visible-near-infrared transmission spectroscopy techniques have been applied to chicken egg studies, such techniques are not suitable for duck eggs because duck eggshells are much thicker than chicken eggshells. In this study, a high-transmission spectrometer using an equilateral prism as a dispersive element and a flash lamp as a light source was constructed to nondestructively detect the transmission spectrum of duck eggs and monitor the pickling of eggs. The evolution of egg transmittance was highly correlated with the albumen during pickling. The transmittance exponentially decays with time during this period, and the decay rate is related to the pickling rate. The colors of the albumen and yolk remain almost unchanged in the first stage. A multiple linear regression analysis model that realizes a one-to-one association between the days of pickling and the transmission spectra was constructed to determine the pickling duration in the second stage. The coefficient of determination reached 0.88 for a single variable, wavelength, at 590 nm. This method can monitor the maturity of pickled eggs in real time and does not require the evolution of light transmittance.


Sign in / Sign up

Export Citation Format

Share Document