scholarly journals Unbiased Precision Estimation under Separate Sampling

2018 ◽  
Author(s):  
Shuilian Xie ◽  
Ulisses M. Braga-Neto

AbstractMotivationPrecision and recall have become very popular classification accuracy metrics in the statistical learning literature. These metrics are ordinarily defined under the assumption that the data are sampled randomly from the mixture of the populations. However, observational case-control studies for biomarker discovery often collect data that are sampled separately from the case and control populations, particularly in the case of rare diseases. This discrepancy may introduce severe bias in classifier accuracy estimation.ResultsWe demonstrate, using both analytical and numerical methods, that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the case prevalences in the data and in the actual population. We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size. If information about the true case prevalence is available from public health records, then a modified precision estimator is proposed that displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm. The accuracy of the theoretical analysis and the performance of the proposed precision estimator under separate sampling are investigated using synthetic and real data from observational case-control studies. The results confirmed that the proposed precision estimator indeed becomes unbiased as sample size increases, while the ordinary precision estimator may display large bias, particularly in the case of rare diseases.AvailabilityExtra plots are available as Supplementary Materials.Author summaryBiomedical data are often sampled separately from the case and control populations, particularly in the case of rare diseases. Precision is a popular classification accuracy metric in the statistical learning literature, which implicitly assumes that the data are sampled randomly from the mixture of the populations. In this paper we study the bias of precision under separate sampling using theoretical and numerical methods. We also propose a precision estimator for separate sampling in the case when the prevalence is known from public health records. The results confirmed that the proposed precision estimator becomes unbiased as sample size increases, while the ordinary precision estimator may display large bias, particularly in the case of rare diseases. In the absence of any knowledge about disease prevalence, precision estimates should be avoided under separate sampling.

2019 ◽  
Vol 18 ◽  
pp. 117693511986082
Author(s):  
Shuilian Xie ◽  
Ulisses M Braga-Neto

Observational case-control studies for biomarker discovery in cancer studies often collect data that are sampled separately from the case and control populations. We present an analysis of the bias in the estimation of the precision of classifiers designed on separately sampled data. The analysis consists of both theoretical and numerical results, which show that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the true case prevalence in the population and the sample prevalence in the data. We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size. If information about the true case prevalence is available from public health records, then a modified precision estimator that uses the known prevalence displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm. The accuracy of the theoretical analysis and the performance of the precision estimators under separate sampling are confirmed by numerical experiments using synthetic and real data from published observational case-control studies. The results with real data confirmed that under separately sampled data, the usual estimator produces larger, ie, more optimistic, precision estimates than the estimator using the true prevalence value.


2017 ◽  
Vol 28 (3) ◽  
pp. 822-834
Author(s):  
Mitchell H Gail ◽  
Sebastien Haneuse

Sample size calculations are needed to design and assess the feasibility of case-control studies. Although such calculations are readily available for simple case-control designs and univariate analyses, there is limited theory and software for multivariate unconditional logistic analysis of case-control data. Here we outline the theory needed to detect scalar exposure effects or scalar interactions while controlling for other covariates in logistic regression. Both analytical and simulation methods are presented, together with links to the corresponding software.


Author(s):  
Koustuv Saha ◽  
Amit Sharma

Online mental health communities enable people to seek and provide support, and growing evidence shows the efficacy of community participation to cope with mental health distress. However, what factors of peer support lead to favorable psychosocial outcomes for individuals is less clear. Using a dataset of over 300K posts by ∼39K individuals on an online community TalkLife, we present a study to investigate the effect of several factors, such as adaptability, diversity, immediacy, and the nature of support. Unlike typical causal studies that focus on the effect of each treatment, we focus on the outcome and address the reverse causal question of identifying treatments that may have led to the outcome, drawing on case-control studies in epidemiology. Specifically, we define the outcome as an aggregate of affective, behavioral, and cognitive psychosocial change and identify Case (most improved) and Control (least improved) cohorts of individuals. Considering responses from peers as treatments, we evaluate the differences in the responses received by Case and Control, per matched clusters of similar individuals. We find that effective support includes complex language factors such as diversity, adaptability, and style, but simple indicators such as quantity and immediacy are not causally relevant. Our work bears methodological and design implications for online mental health platforms, and has the potential to guide suggestive interventions for peer supporters on these platforms.


2005 ◽  
Vol 26 (4) ◽  
pp. 342-345 ◽  
Author(s):  
Anthony D. Harris ◽  
Yehuda Carmeli ◽  
Matthew H. Samore ◽  
Keith S. Kaye ◽  
Eli Perencevich

AbstractBackground:Case-control studies often analyze risk factors for antibiotic resistance. Recently published articles have illustrated that randomly selected control-patients may be preferable to those with the susceptible phenotype of the organism. A possible methodologic problem with randomly selected control-patients is potential bias due to control group misclassification. This occurs if some control-patients did not have clinical cultures performed and thus might have been unidentified case-patients. If this bias exists, these studies might be expected to report lower odds ratios (ORs) because control-patients would be more like case-patients.Objective:To analyze potential biases that might arise due to control group misclassification and potentially larger selection biases that may be introduced if control-patients are required to have at least one clinical culture.Patients:One hundred twenty case-patients, 770 control-patients in group 1, and 510 control-patients in group 2.Methods:Two case-control studies. Case-patients had clinical cultures positive for imipenem-resistant Pseudomonas aeruginosa. The first group of control-patients were random. The second group of control-patients were identical to those in group 1 except being required to have at least one clinical culture.Results:Univariate analyses showed higher ORs for case-patients versus control-patients in group 1 (imipenem [OR, 12.5], piperacillin-tazobactam [OR, 3.7], and vancomycin [OR, 4.7]) as compared with case-patients versus control-patients in group 2 (imipenem [OR, 8.0], piperacillin-tazobactam [OR, 2.5], and vancomycin [OR, 3.0]).Conclusion:Requiring control-patients to have at least one clinical culture introduces a selection bias likely because it eliminates patients with less severe illness.


2016 ◽  
Vol 8 (1) ◽  
pp. 36-44
Author(s):  
Neslihan DEMİREL ◽  
Özlem EGE ORUÇ ◽  
Selma GÜRLER

Biometrics ◽  
1986 ◽  
Vol 42 (4) ◽  
pp. 927 ◽  
Author(s):  
Robert F. Woolson ◽  
Judy A. Bean ◽  
Patricio B. Rojas

Sign in / Sign up

Export Citation Format

Share Document