scholarly journals Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)

Author(s):  
Anna Lin ◽  
Soon Song ◽  
Nancy Wang

IntroductionStats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to inconsistency. Objectives and ApproachA modelled approach, ‘SoLinks’ has been developed in order to automate the FP estimation process for the IDI. It uses a logistic regression model to calculate the probability that a given link is a true match. The model is based on the agreement types defined for four key linking variables – first name, last name, sex, and date of birth. Exemptions have been given to some specific types of links that we believe to be high quality true matches. The training data used to estimate the model parameters was based on the outcomes of the clerical review process over several years. ResultsWe have compared the FP rates estimated through clerical review to the ones estimated through the SoLinks model. Some SoLinks estimates fall outside the 95% confidence intervals of the clerically reviewed ones. This may be the result of the pre-defined probabilities for the specific types of links are too high. ConclusionThe automation of FP checking has saved analyst time and resource. The modelled FP estimates have been more stable across time than the previous clerical reviews. As this model estimates the probability of a true match at the individual link level, we may provide this probability to researchers so that they can calculate linked quality indicators for their research populations.

1993 ◽  
Vol 32 (02) ◽  
pp. 175-179 ◽  
Author(s):  
B. Brambati ◽  
T. Chard ◽  
J. G. Grudzinskas ◽  
M. C. M. Macintosh

Abstract:The analysis of the clinical efficiency of a biochemical parameter in the prediction of chromosome anomalies is described, using a database of 475 cases including 30 abnormalities. A comparison was made of two different approaches to the statistical analysis: the use of Gaussian frequency distributions and likelihood ratios, and logistic regression. Both methods computed that for a 5% false-positive rate approximately 60% of anomalies are detected on the basis of maternal age and serum PAPP-A. The logistic regression analysis is appropriate where the outcome variable (chromosome anomaly) is binary and the detection rates refer to the original data only. The likelihood ratio method is used to predict the outcome in the general population. The latter method depends on the data or some transformation of the data fitting a known frequency distribution (Gaussian in this case). The precision of the predicted detection rates is limited by the small sample of abnormals (30 cases). Varying the means and standard deviations (to the limits of their 95% confidence intervals) of the fitted log Gaussian distributions resulted in a detection rate varying between 42% and 79% for a 5% false-positive rate. Thus, although the likelihood ratio method is potentially the better method in determining the usefulness of a test in the general population, larger numbers of abnormal cases are required to stabilise the means and standard deviations of the fitted log Gaussian distributions.


2018 ◽  
Vol 164 ◽  
pp. 01047 ◽  
Author(s):  
Viny Christanti Mawardi ◽  
Niko Susanto ◽  
Dali Santun Naga

Any mistake in writing of a document will cause the information to be told falsely. These days, most of the document is written with a computer. For that reason, spelling correction is needed to solve any writing mistakes. This design process discuss about the making of spelling correction for document text in Indonesian language with document's text as its input and a .txt file as its output. For the realization, 5 000 news articles have been used as training data. Methods used includes Finite State Automata (FSA), Levenshtein distance, and N-gram. The results of this designing process are shown by perplexity evaluation, correction hit rate and false positive rate. Perplexity with the smallest value is a unigram with value 1.14. On the other hand, the highest percentage of correction hit rate is bigram and trigram with value 71.20 %, but bigram is superior in processing time average which is 01:21.23 min. The false positive rate of unigram, bigram, and trigram has the same percentage which is 4.15 %. Due to the disadvantages at using FSA method, modification is done and produce bigram's correction hit rate as high as 85.44 %.


2005 ◽  
Vol 12 (4) ◽  
pp. 197-201 ◽  
Author(s):  
Nicholas J Wald ◽  
Joan K Morris ◽  
Simon Rish

Objective: To determine the quantitative effect on overall screening performance (detection rate for a given false-positive rate) of using several moderately strong, independent risk factors in combination as screening markers. Setting: Theoretical statistical analysis. Methods: For the purposes of this analysis, it was assumed that all risk factors were independent, had Gaussian distributions with the same standard deviation in affected and unaffected individuals and had the same screening performance. We determined the overall screening performance associated with using an increasing number of risk factors together, with each risk factor having a detection rate of 10%, 15% or 20% for a 5% false-positive rate. The overall screening performance was estimated as the detection rate for a 5% false-positive rate. Results: Combining the risk factors increased the screening performance, but the gain in detection at a constant false-positive rate was relatively modest and diminished with the addition of each risk factor. Combining three risk factors, each with a 15% detection rate for a 5% false-positive rate, yields a 28% detection rate. Combining five risk factors increases the detection rate to 39%. If the individual risk factors have a detection rate of 10% for a 5% false-positive rate, it would require combining about 15 such risk factors to achieve a comparable overall detection rate (41%). Conclusion: It is intuitively thought that combining moderately strong risk factors can substantially improve screening performance. For example, most cardiovascular risk factors that may be used in screening for ischaemic heart disease events, such as serum cholesterol and blood pressure, have a relatively modest screening performance (about 15% detection rate for a 5% false-positive rate). It would require the combination of about 15 or 20 such risk factors to achieve detection rates of about 80% for a 5% false-positive rate. This is impractical, given the risk factors so far discovered, because there are too few risk factors and their associations with disease are too weak.


2021 ◽  
Vol 59 (3) ◽  
pp. 865-918
Author(s):  
Ran Abramitzky ◽  
Leah Boustan ◽  
Katherine Eriksson ◽  
James Feigenbaum ◽  
Santiago Pérez

The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5 percent) false positive rates. The automated methods trace out a frontier illustrating the trade-off between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods. (JEL C81, C83, N01, N31, N32)


2020 ◽  
Vol 14 ◽  
Author(s):  
Giles Tetteh ◽  
Velizar Efremov ◽  
Nils D. Forkert ◽  
Matthias Schneider ◽  
Jan Kirschke ◽  
...  

We present DeepVesselNet, an architecture tailored to the challenges faced when extracting vessel trees and networks and corresponding features in 3-D angiographic volumes using deep learning. We discuss the problems of low execution speed and high memory requirements associated with full 3-D networks, high-class imbalance arising from the low percentage (<3%) of vessel voxels, and unavailability of accurately annotated 3-D training data—and offer solutions as the building blocks of DeepVesselNet. First, we formulate 2-D orthogonal cross-hair filters which make use of 3-D context information at a reduced computational burden. Second, we introduce a class balancing cross-entropy loss function with false-positive rate correction to handle the high-class imbalance and high false positive rate problems associated with existing loss functions. Finally, we generate a synthetic dataset using a computational angiogenesis model capable of simulating vascular tree growth under physiological constraints on local network structure and topology and use these data for transfer learning. We demonstrate the performance on a range of angiographic volumes at different spatial scales including clinical MRA data of the human brain, as well as CTA microscopy scans of the rat brain. Our results show that cross-hair filters achieve over 23% improvement in speed, lower memory footprint, lower network complexity which prevents overfitting and comparable accuracy that does not differ from full 3-D filters. Our class balancing metric is crucial for training the network, and transfer learning with synthetic data is an efficient, robust, and very generalizable approach leading to a network that excels in a variety of angiography segmentation tasks. We observe that sub-sampling and max pooling layers may lead to a drop in performance in tasks that involve voxel-sized structures. To this end, the DeepVesselNet architecture does not use any form of sub-sampling layer and works well for vessel segmentation, centerline prediction, and bifurcation detection. We make our synthetic training data publicly available, fostering future research, and serving as one of the first public datasets for brain vessel tree segmentation and analysis.


2002 ◽  
Vol 41 (01) ◽  
pp. 37-41 ◽  
Author(s):  
S. Shung-Shung ◽  
S. Yu-Chien ◽  
Y. Mei-Due ◽  
W. Hwei-Chung ◽  
A. Kao

Summary Aim: Even with careful observation, the overall false-positive rate of laparotomy remains 10-15% when acute appendicitis was suspected. Therefore, the clinical efficacy of Tc-99m HMPAO labeled leukocyte (TC-WBC) scan for the diagnosis of acute appendicitis in patients presenting with atypical clinical findings is assessed. Patients and Methods: Eighty patients presenting with acute abdominal pain and possible acute appendicitis but atypical findings were included in this study. After intravenous injection of TC-WBC, serial anterior abdominal/pelvic images at 30, 60, 120 and 240 min with 800k counts were obtained with a gamma camera. Any abnormal localization of radioactivity in the right lower quadrant of the abdomen, equal to or greater than bone marrow activity, was considered as a positive scan. Results: 36 out of 49 patients showing positive TC-WBC scans received appendectomy. They all proved to have positive pathological findings. Five positive TC-WBC were not related to acute appendicitis, because of other pathological lesions. Eight patients were not operated and clinical follow-up after one month revealed no acute abdominal condition. Three of 31 patients with negative TC-WBC scans received appendectomy. They also presented positive pathological findings. The remaining 28 patients did not receive operations and revealed no evidence of appendicitis after at least one month of follow-up. The overall sensitivity, specificity, accuracy, positive and negative predictive values for TC-WBC scan to diagnose acute appendicitis were 92, 78, 86, 82, and 90%, respectively. Conclusion: TC-WBC scan provides a rapid and highly accurate method for the diagnosis of acute appendicitis in patients with equivocal clinical examination. It proved useful in reducing the false-positive rate of laparotomy and shortens the time necessary for clinical observation.


2019 ◽  
Author(s):  
Amanda Kvarven ◽  
Eirik Strømland ◽  
Magnus Johannesson

Andrews & Kasy (2019) propose an approach for adjusting effect sizes in meta-analysis for publication bias. We use the Andrews-Kasy estimator to adjust the result of 15 meta-analyses and compare the adjusted results to 15 large-scale multiple labs replication studies estimating the same effects. The pre-registered replications provide precisely estimated effect sizes, which do not suffer from publication bias. The Andrews-Kasy approach leads to a moderate reduction of the inflated effect sizes in the meta-analyses. However, the approach still overestimates effect sizes by a factor of about two or more and has an estimated false positive rate of between 57% and 100%.


Sign in / Sign up

Export Citation Format

Share Document