Ranking procedures for matched pairs with missing data — Asymptotic theory and a small sample approximation

2012 ◽  
Vol 56 (5) ◽  
pp. 1090-1102 ◽  
Author(s):  
F. Konietschke ◽  
S.W. Harrar ◽  
K. Lange ◽  
E. Brunner
Author(s):  
Marcos Barreto ◽  
André Alves ◽  
Samila Sena ◽  
Rosemeire Fiaccone ◽  
Leila Amorim ◽  
...  

ABSTRACT Background and aimsThe Brazilian government has several social protection programmes that select their beneficiaries based on socioeconomic information kept in the CadastroÚnico (CADU) database. The CADU will be used to build a population-based cohort of approximately 100 million individuals. Among the social programmes is the Bolsa Família (PBF), a conditional cash transfer programme that provides extra income to poor families. These two databases must be deterministically linked to individuals who have received payments from PBF between 2004 and 2012. It will be used in epidemiological studies aiming to assess the impact of PBF on the occurrence and severity of several diseases and health problems (tuberculosis, leprosy, HIV, child health etc). This cohort must be probabilistically linked with databases from the Unified Health System (SUS), such as hospitalization, notifiable diseases, mortality, and live births, in order to produce data marts (domain-specific data) to the proposed studies. Our goals comprise the validation of probabilistic record linkage methods to support this cohort setup. ApproachThis paper emphasizes the accuracy assessment of our methods based on the linkage of SIH (hospitalization), SINAN (notifications), and SIM (mortality) records to the 2011 extraction of CADU. We focused on hospitalization and notification of tuberculosis, as well infant mortality for all causes in under-4 children, for a small sample with 30,029 records (CADU). Due to the absence of gold standards, we used two approaches to assess accuracy: a clerical review and an automatic (tool-based) search. In the first case, we used different cut-off points as similarity index to calculate sensitivity and specificity, and a ROC curve to separate matched and non-matched pairs. The second approach retrieves from CADU all matched and non-matched pairs for a given individual, serving as a gold standard for validation. ResultsWe retrieved 22 linked pairs, from which 18 are true positives for infant mortality (SIM database). From SINAN, our results were 434 linked pairs with 166 true positives, and with SIH, 121 linked pairs with 34 true positives. The sensitivity of manual scan for SIM (children mortality) ranges from 44% (specificity of 100%) to 95% (specificity of 94%), with similarity indices between 0.80 and 0.97, respectively. For automatic search, we obtained a sensitivity of 69.2% and specificity of 91.8%. ConclusionOur results show the need for a continuous improvement in our linkage routines and how to consistently evaluate their accuracy in the absence of adequate gold standards.


2021 ◽  
Vol 30 (10) ◽  
pp. 2221-2238
Author(s):  
Sarah B Peskoe ◽  
David Arterburn ◽  
Karen J Coleman ◽  
Lisa J Herrinton ◽  
Michael J Daniels ◽  
...  

While electronic health records data provide unique opportunities for research, numerous methodological issues must be considered. Among these, selection bias due to incomplete/missing data has received far less attention than other issues. Unfortunately, standard missing data approaches (e.g. inverse-probability weighting and multiple imputation) generally fail to acknowledge the complex interplay of heterogeneous decisions made by patients, providers, and health systems that govern whether specific data elements in the electronic health records are observed. This, in turn, renders the missing-at-random assumption difficult to believe in standard approaches. In the clinical literature, the collection of decisions that gives rise to the observed data is referred to as the data provenance. Building on a recently-proposed framework for modularizing the data provenance, we develop a general and scalable framework for estimation and inference with respect to regression models based on inverse-probability weighting that allows for a hierarchy of missingness mechanisms to better align with the complex nature of electronic health records data. We show that the proposed estimator is consistent and asymptotically Normal, derive the form of the asymptotic variance, and propose two consistent estimators. Simulations show that naïve application of standard methods may yield biased point estimates, that the proposed estimators have good small-sample properties, and that researchers may have to contend with a bias-variance trade-off as they consider how to handle missing data. The proposed methods are motivated by an on-going, electronic health records-based study of bariatric surgery.


Blood ◽  
2008 ◽  
Vol 112 (11) ◽  
pp. 563-563 ◽  
Author(s):  
Ann E Woolfrey ◽  
John Klein ◽  
Michael D Haagenson ◽  
Stephen R Spellman ◽  
Minoo Battiwalla ◽  
...  

Abstract Criteria for the selection of HLA mismatched donors are needed when an HLA matched unrelated donor is not available. To define the risks associated with mismatching at HLA loci, and the impact of number of HLA mismatches on outcome, we studied 1933 patients receiving URD peripheral blood stem cell (PBSC) transplants facilitated by the National Marrow Donor Program between 1999–2006 for treatment of AML, ALL, CML or MDS. Myeloablative (65%) and reduced intensity (35%) regimens were included. The transplanted PBSC grafts were T cell-replete, and most patients received calcineurin-inhibitor based GVHD prophylaxis (99%) with T replete grafts. Median follow-up was 2 years. Pairs were typed for HLA-A, B, C, DRB1, DQA1 and DQB1 by high resolution typing methods. Matching was classified as low resolution (antigen-equivalent) or high resolution (allele) involving HLA-A, B, C, and DRB1 (8/8 match). Because of multiple comparisons, p-values <0.01 were considered significant. All analyses were adjusted for patient and transplant characteristics. Results: No effect of HLA-DQ mismatching was found for 8/8 or 7/8 matched transplant pairs, henceforth DQ mismatch was removed from subsequent models. Matching for 8/8 alleles was associated with better survival at one year (56% vs. 47%, p=0.001) compared with 7/8 matched pairs. Using patients with 8/8 match for comparison (n=1243), a single HLA-antigen mismatch (n=293) was associated with a significantly higher risk for overall mortality (OM), (relative risk (RR)=1.32, 95% confidence interval [CI] 1.12–1.55, p=0.0007), transplant-related mortality (TRM), (RR 1.54 [1.24–1.91] p=0.0001), grades III-IV graft-vs.-host disease (GVHD), (RR 1.93 [1.53–2.44] p<0.0001), and lower disease-free survival (DFS), (RR 1.29 [1.10–1.51] p=0.0013). No statistically significant decrement in survival was seen for those with a single (n=208) or double (n=28) HLA-allele mismatches involving HLA-A, B, C, and/or DRB1, although small sample size limits the power of the analysis. Two antigen or antigen plus allele mismatches [6/8 pairs] were associated with 2 to 3 times the risk for OM and TRM compared with 8/8 matched pairs, all p<0.001. Comparing 8/8 to 7/8 donor-recipient pairs mismatched at specific loci, only HLA-C antigen mismatches (n=187) were significantly associated with lower DFS (RR=1.36 [1.13–1.64] p=0.0010), and increased risk for OM (RR=1.41 [1.16–1.70], p=0.0005), TRM (RR=1.61 [1.25–2.08], p=0.0002), and GVHD grades III-IV (RR=1.98 [1.50–2.62], p<0.0001). No differences in outcome were observed for HLA-C allele mismatch (n=61), nor for mismatches at HLA-A antigen/allele (n=136), -B antigen/allele (n=73), -DRB1 allele (n=39) or -DQ antigen/allele (n=114) compared to 8/8 matching. HLA mismatching was not associated with relapse or chronic GVHD. Conclusion: These data suggest that when 8/8 matched PBSC donors are not available; HLA-C antigen mismatched donors should be avoided. The effects of HLA-mismatching in URD PBSC may be distinct from marrow transplants, although additional studies with larger numbers of patients may increase the power to detect effects of other specific locus mismatches.


2016 ◽  
Vol 38 (4) ◽  
pp. 195-206
Author(s):  
Dan Farley ◽  
Daniel Anderson ◽  
P. Shawn Irvin ◽  
Gerald Tindal

Modeling growth for students with significant cognitive disabilities (SWSCD) is difficult due to a variety of factors, including, but not limited to, missing data, test scaling, group heterogeneity, and small sample sizes. These challenges may account for the paucity of previous research exploring the academic growth of SWSCD. Our study represents a unique context in which a reading assessment, calibrated to a common scale, was administered statewide to students in consecutive years across Grades 3 to 5. We used a nonlinear latent growth curve pattern-mixture model to estimate students’ achievement and growth while accounting for patterns of missing data. While we observed significant intercept differences across disability subgroups, there were no significant slope differences. Incorporating missing data patterns into our models improved model fit. Limitations and directions for future research are discussed.


2014 ◽  
Vol 38 (5) ◽  
pp. 435-452 ◽  
Author(s):  
Fan Jia ◽  
E. Whitney G. Moore ◽  
Richard Kinai ◽  
Kelly S. Crowe ◽  
Alexander M. Schoemann ◽  
...  

Utilizing planned missing data (PMD) designs (ex. 3-form surveys) enables researchers to ask participants fewer questions during the data collection process. An important question, however, is just how few participants are needed to effectively employ planned missing data designs in research studies. This article explores this question by using simulated three-form planned missing data to assess analytic model convergence, parameter estimate bias, standard error bias, mean squared error (MSE), and relative efficiency (RE).Three models were examined: a one-time-point, cross-sectional model with 3 constructs; a two-time-point model with 3 constructs at each time point; and a three-time-point, mediation model with 3 constructs over three time points. Both full-information maximum likelihood (FIML) and multiple imputation (MI) were used to handle the missing data. Models were found to meet convergence rate and acceptable bias criteria with FIML at smaller sample sizes than with MI.


Sign in / Sign up

Export Citation Format

Share Document