scholarly journals Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study

2019 ◽  
Vol 8 (1) ◽  
Author(s):  
Gerald Gartlehner ◽  
Gernot Wagner ◽  
Linda Lux ◽  
Lisa Affengruber ◽  
Andreea Dobrescu ◽  
...  

Abstract Background Web applications that employ natural language processing technologies to support systematic reviewers during abstract screening have become more common. The goal of our project was to conduct a case study to explore a screening approach that temporarily replaces a human screener with a semi-automated screening tool. Methods We evaluated the accuracy of the approach using DistillerAI as a semi-automated screening tool. A published comparative effectiveness review served as the reference standard. Five teams of professional systematic reviewers screened the same 2472 abstracts in parallel. Each team trained DistillerAI with 300 randomly selected abstracts that the team screened dually. For all remaining abstracts, DistillerAI replaced one human screener and provided predictions about the relevance of records. A single reviewer also screened all remaining abstracts. A second human screener resolved conflicts between the single reviewer and DistillerAI. We compared the decisions of the machine-assisted approach, single-reviewer screening, and screening with DistillerAI alone against the reference standard. Results The combined sensitivity of the machine-assisted screening approach across the five screening teams was 78% (95% confidence interval [CI], 66 to 90%), and the combined specificity was 95% (95% CI, 92 to 97%). By comparison, the sensitivity of single-reviewer screening was similar (78%; 95% CI, 66 to 89%); however, the sensitivity of DistillerAI alone was substantially worse (14%; 95% CI, 0 to 31%) than that of the machine-assisted screening approach. Specificities for single-reviewer screening and DistillerAI were 94% (95% CI, 91 to 97%) and 98% (95% CI, 97 to 100%), respectively. Machine-assisted screening and single-reviewer screening had similar areas under the curve (0.87 and 0.86, respectively); by contrast, the area under the curve for DistillerAI alone was just slightly better than chance (0.56). The interrater agreement between human screeners and DistillerAI with a prevalence-adjusted kappa was 0.85 (95% CI, 0.84 to 0.86%). Conclusions The accuracy of DistillerAI is not yet adequate to replace a human screener temporarily during abstract screening for systematic reviews. Rapid reviews, which do not require detecting the totality of the relevant evidence, may find semi-automation tools to have greater utility than traditional systematic reviews.

2020 ◽  
Vol 20 (1) ◽  
Author(s):  
C. Hamel ◽  
S. E. Kelly ◽  
K. Thavorn ◽  
D. B. Rice ◽  
G. A. Wells ◽  
...  

Abstract Background Systematic reviews often require substantial resources, partially due to the large number of records identified during searching. Although artificial intelligence may not be ready to fully replace human reviewers, it may accelerate and reduce the screening burden. Using DistillerSR (May 2020 release), we evaluated the performance of the prioritization simulation tool to determine the reduction in screening burden and time savings. Methods Using a true recall @ 95%, response sets from 10 completed systematic reviews were used to evaluate: (i) the reduction of screening burden; (ii) the accuracy of the prioritization algorithm; and (iii) the hours saved when a modified screening approach was implemented. To account for variation in the simulations, and to introduce randomness (through shuffling the references), 10 simulations were run for each review. Means, standard deviations, medians and interquartile ranges (IQR) are presented. Results Among the 10 systematic reviews, using true recall @ 95% there was a median reduction in screening burden of 47.1% (IQR: 37.5 to 58.0%). A median of 41.2% (IQR: 33.4 to 46.9%) of the excluded records needed to be screened to achieve true recall @ 95%. The median title/abstract screening hours saved using a modified screening approach at a true recall @ 95% was 29.8 h (IQR: 28.1 to 74.7 h). This was increased to a median of 36 h (IQR: 32.2 to 79.7 h) when considering the time saved not retrieving and screening full texts of the remaining 5% of records not yet identified as included at title/abstract. Among the 100 simulations (10 simulations per review), none of these 5% of records were a final included study in the systematic review. The reduction in screening burden to achieve true recall @ 95% compared to @ 100% resulted in a reduced screening burden median of 40.6% (IQR: 38.3 to 54.2%). Conclusions The prioritization tool in DistillerSR can reduce screening burden. A modified or stop screening approach once a true recall @ 95% is achieved appears to be a valid method for rapid reviews, and perhaps systematic reviews. This needs to be further evaluated in prospective reviews using the estimated recall.


2020 ◽  
Vol 9 (1) ◽  
Author(s):  
Allison Gates ◽  
Michelle Gates ◽  
Daniel DaRosa ◽  
Sarah A. Elliott ◽  
Jennifer Pillay ◽  
...  

Abstract Background We evaluated the benefits and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening and explored whether Abstrackr’s predictions varied by review or study-level characteristics. Methods For a convenience sample of 16 reviews for which adequate data were available to address our objectives (11 systematic reviews and 5 rapid reviews), we screened a 200-record training set in Abstrackr and downloaded the relevance (relevant or irrelevant) of the remaining records, as predicted by the tool. We retrospectively simulated the liberal-accelerated screening approach. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool’s predictions varied by review and study-level characteristics. Results Using the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records that were included in the final reports, but saved a median (IQR) 26 (9, 42) h of screening time. One missed study was included in eight pairwise meta-analyses in one systematic review. The pooled effect for just one of those meta-analyses changed considerably (from MD (95% CI) − 1.53 (− 2.92, − 0.15) to − 1.17 (− 2.70, 0.36)). Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P = 0.37) or intervention type (simple or complex, P = 0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P = 0.01), or that included only trials (95%) vs. multiple designs (86%) (P = 0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P = 0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P = 0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P = 0.02) were more often correctly predicted as relevant. Conclusion Our screening approach saved time and may be suitable in conditions where the limited risk of missing relevant records is acceptable. Several of our findings are paradoxical and require further study to fully understand the tasks to which ML-assisted screening is best suited. The findings should be interpreted in light of the fact that the protocol was prepared for the funder, but not published a priori. Because we used a convenience sample, the findings may be prone to selection bias. The results may not be generalizable to other samples of reviews, ML tools, or screening approaches. The small number of missed studies across reviews with pairwise meta-analyses hindered strong conclusions about the effect of missed studies on the results and conclusions of systematic reviews.


2020 ◽  
Author(s):  
Allison Gates ◽  
Michelle Gates ◽  
Daniel DaRosa ◽  
Sarah A. Elliott ◽  
Jennifer Pillay ◽  
...  

Abstract Background. We evaluated the benefits and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening, and explored whether Abstrackr’s predictions varied by review or study-level characteristics.Methods. For a convenience sample of 16 reviews for which adequate data were available to address our objectives (11 systematic reviews and 5 rapid reviews) we screened a 200-record training set in Abstrackr and downloaded the relevance (relevant or irrelevant) of the remaining records, as predicted by the tool. We retrospectively simulated the liberal-accelerated screening approach. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool’s predictions varied by review and study-level.Results. Using the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records that were included in the final reports, but saved a median (IQR) 26 (9, 42) hours of screening time. One missed study was included in eight pairwise meta-analyses in one systematic review. The pooled effect for just one of those meta-analyses changed considerably (from MD (95% CI) -1.53 (-2.92, -0.15) to -1.17 (-2.70, 0.36)). Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P=0.37) or intervention type (simple or complex, P=0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P=0.01), or that included only trials (95%) vs. multiple designs (86%) (P=0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P=0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P=0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P=0.02) were more often correctly predicted as relevant.Conclusion. Our screening approach saved time and may be suitable in conditions where the limited risk of missing relevant records is acceptable. As several of our findings are paradoxical, and require further study to fully understand the tasks to which ML-assisted screening is best suited. The findings should be interpreted in light of the fact that the protocol was prepared for the funder, but not published a priori. Because we used a convenience sample the findings may be prone to selection bias. The results may not be generalizable to other samples of reviews, ML tools, or screening approaches. The small number of missed studies across reviews with pairwise meta-analyses hindered strong conclusions about the effect of missed studies on the results and conclusions of systematic reviews.


Sensors ◽  
2021 ◽  
Vol 21 (3) ◽  
pp. 1012
Author(s):  
Jisu Hwang ◽  
Incheol Kim

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.


2020 ◽  
Author(s):  
Christopher A Hane ◽  
Vijay S Nori ◽  
William H Crown ◽  
Darshak M Sanghavi ◽  
Paul Bleicher

BACKGROUND Clinical trials need efficient tools to assist in recruiting patients at risk of Alzheimer disease and related dementias (ADRD). Early detection can also assist patients with financial planning for long-term care. Clinical notes are an important, underutilized source of information in machine learning models because of the cost of collection and complexity of analysis. OBJECTIVE This study aimed to investigate the use of deidentified clinical notes from multiple hospital systems collected over 10 years to augment retrospective machine learning models of the risk of developing ADRD. METHODS We used 2 years of data to predict the future outcome of ADRD onset. Clinical notes are provided in a deidentified format with specific terms and sentiments. Terms in clinical notes are embedded into a 100-dimensional vector space to identify clusters of related terms and abbreviations that differ across hospital systems and individual clinicians. RESULTS When using clinical notes, the area under the curve (AUC) improved from 0.85 to 0.94, and positive predictive value (PPV) increased from 45.07% (25,245/56,018) to 68.32% (14,153/20,717) in the model at disease onset. Models with clinical notes improved in both AUC and PPV in years 3-6 when notes’ volume was largest; results are mixed in years 7 and 8 with the smallest cohorts. CONCLUSIONS Although clinical notes helped in the short term, the presence of ADRD symptomatic terms years earlier than onset adds evidence to other studies that clinicians undercode diagnoses of ADRD. De-identified clinical notes increase the accuracy of risk models. Clinical notes collected across multiple hospital systems via natural language processing can be merged using postprocessing techniques to aid model accuracy.


BMC Cancer ◽  
2022 ◽  
Vol 22 (1) ◽  
Author(s):  
Junren Kang ◽  
Hailong Li ◽  
Xiaodong Shi ◽  
Enling Ma ◽  
Wei Chen

Abstract Background Malnutrition is common in cancer patients. The NUTRISCORE is a newly developed cancer-specific nutritional screening tool and was validated by comparison with the Patient-Generated Subjective Global Assessment (PG-SGA) and Malnutrition Screening Tool (MST) in Spain. We aimed to evaluate the performance of the NUTRISCORE, MST, and PG-SGA in estimating the risk of malnutrition in Chinese cancer patients. Methods Data from an open parallel and multicenter cross-sectional study in 29 clinical teaching hospitals in 14 Chinese cities were used. Cancer patients were assessed for malnutrition using the PG-SGA, NUTRISCORE, and MST. The sensitivity, specificity, and areas under the receiver operating characteristic curve were estimated for the NUTRISCORE and MST using the PG-SGA as a reference. Results A total of 1000 cancer patients were included. The mean age was 55.9 (19 to 92 years), and 47.5% were male. Of these patients, 450 (45.0%) had PG-SGA B and C, 29 (2.9%) had a NUTRISCORE ≥5, and 367 (36.7%) had an MST ≥ 2. Using the PG-SGA as a reference, the sensitivity, specificity, and area under the curve values of the NUTRISCORE were found to be 6.2, 99.8%, and 0.53, respectively. The sensitivity, specificity, and area under the curve values of the MST were 50.9, 74.9%, and 0.63, respectively. The kappa index between the NUTRISCORE and PG-SGA was 0.066, and that between the MST and PG-SGA was 0.262 (P < 0.05). Conclusions The NUTRISCORE had an extremely low sensitivity in cancer patients in China compared with the MST when the PG-SGA was used as a reference.


2019 ◽  
Vol 23 (40) ◽  
pp. 1-194 ◽  
Author(s):  
Alasdair MJ MacLullich ◽  
Susan D Shenkin ◽  
Steve Goodacre ◽  
Mary Godfrey ◽  
Janet Hanley ◽  
...  

Background Delirium is a common and serious neuropsychiatric syndrome, usually triggered by illness or drugs. It remains underdetected. One reason for this is a lack of brief, pragmatic assessment tools. The 4 ‘A’s test (Arousal, Attention, Abbreviated Mental Test – 4, Acute change) (4AT) is a screening tool designed for routine use. This project evaluated its usability, diagnostic accuracy and cost. Methods Phase 1 – the usability of the 4AT in routine practice was measured with two surveys and two qualitative studies of health-care professionals, and a review of current clinical use of the 4AT as well as its presence in guidelines and reports. Phase 2 – the 4AT’s diagnostic accuracy was assessed in newly admitted acute medical patients aged ≥ 70 years. Its performance was compared with that of the Confusion Assessment Method (CAM; a longer screening tool). The performance of individual 4AT test items was related to cognitive status, length of stay, new institutionalisation, mortality at 12 weeks and outcomes. The method used was a prospective, double-blind diagnostic test accuracy study in emergency departments or in acute general medical wards in three UK sites. Each patient underwent a reference standard delirium assessment and was also randomised to receive an assessment with either the 4AT (n = 421) or the CAM (n = 420). A health economics analysis was also conducted. Results Phase 1 found evidence that delirium awareness is increasing, but also that there is a need for education on delirium in general and on the 4AT in particular. Most users reported that the 4AT was useful, and it was in widespread use both in the UK and beyond. No changes to the 4AT were considered necessary. Phase 2 involved 785 individuals who had data for analysis; their mean age was 81.4 (standard deviation 6.4) years, 45% were male, 99% were white and 9% had a known dementia diagnosis. The 4AT (n = 392) had an area under the receiver operating characteristic curve of 0.90. A positive 4AT score (> 3) had a specificity of 95% [95% confidence interval (CI) 92% to 97%] and a sensitivity of 76% (95% CI 61% to 87%) for reference standard delirium. The CAM (n = 382) had a specificity of 100% (95% CI 98% to 100%) and a sensitivity of 40% (95% CI 26% to 57%) in the subset of participants whom it was possible to assess using this. Patients with positive 4AT scores had longer lengths of stay (median 5 days, interquartile range 2.0–14.0 days) than did those with negative 4AT scores (median 2 days, interquartile range 1.0–6.0 days), and they had a higher 12-week mortality rate (16.1% and 9.2%, respectively). The estimated 12-week costs of an initial inpatient stay for patients with delirium were more than double the costs of an inpatient stay for patients without delirium (e.g. in Scotland, £7559, 95% CI £7362 to £7755, vs. £4215, 95% CI £4175 to £4254). The estimated cost of false-positive cases was £4653, of false-negative cases was £8956, and of a missed diagnosis was £2067. Limitations Patients were aged ≥ 70 years and were assessed soon after they were admitted, limiting generalisability. The treatment of patients in accordance with reference standard diagnosis limited the ability to assess comparative cost-effectiveness. Conclusions These findings support the use of the 4AT as a rapid delirium assessment instrument. The 4AT has acceptable diagnostic accuracy for acute older patients aged > 70 years. Future work Further research should address the real-world implementation of delirium assessment. The 4AT should be tested in other populations. Trial registration Current Controlled Trials ISRCTN53388093. Funding This project was funded by the National Institute for Health Research (NIHR) Health Technology Assessment programme and will be published in full in Health Technology Assessment; Vol. 23, No. 40. See the NIHR Journals Library website for further project information. The funder specified that any new delirium assessment tool should be compared against the CAM, but had no other role in the study design or conduct of the study.


2021 ◽  
Vol 8 ◽  
Author(s):  
Michiel Delesie ◽  
Lieselotte Knaepen ◽  
Johan Verbraecken ◽  
Karolien Weytjens ◽  
Paul Dendale ◽  
...  

Background: Obstructive sleep apnea (OSA) is a modifiable risk factor of atrial fibrillation (AF) but is underdiagnosed in these patients due to absence of good OSA screening pathways. Polysomnography (PSG) is the gold standard for diagnosing OSA but too resource-intensive as a screening tool. We explored whether cardiorespiratory polygraphy (PG) devices using an automated algorithm for Apnea-Hypopnea Index (AHI) determination can meet the requirements of a good screening tool in AF patients.Methods: This prospective study validated the performance of three PGs [ApneaLink Air (ALA), SOMNOtouch RESP (STR) and SpiderSAS (SpS)] in consecutive AF patients who were referred for PSG evaluation. Patients wore one of the three PGs simultaneously with PSG, and a different PG during each of three consecutive nights at home. Severity of OSA was classified according to the AHI during PSG (&lt;5 = no OSA, 5–14 = mild, 15–30 = moderate, &gt;30 = severe).Results: Of the 100 included AF patients, PSG diagnosed at least moderate in 69% and severe OSA in 33%. Successful PG execution at home was obtained in 79.1, 80.2 and 86.8% of patients with the ALA, STR and SpS, respectively. For the detection of clinically relevant OSA (AHI ≥ 15), an area under the curve of 0.802, 0.772 and 0.803 was calculated for the ALA, STR and SpS, respectively.Conclusions: This study indicates that home-worn PGs with an automated AHI algorithm can be used as OSA screening tools in AF patients. Based on an appropriate AHI cut-off value for each PG, the device can guide referral for definite PSG diagnosis.


2017 ◽  
Vol 30 (4) ◽  
pp. 538-564 ◽  
Author(s):  
Grant Duwe

This study examines the development and validation of the Minnesota Sex Offender Screening Tool–4 (MnSOST-4) on a dataset consisting of 5,745 sex offenders released from Minnesota prisons between 2003 and 2012. Bootstrap resampling was used to select predictors, and k-fold and split-sample methods were used to internally validate the MnSOST-4. Using sex offense reconviction within 4 years of release from prison as the failure criterion, the data showed that 130 (2.3%) offenders in the overall sample were recidivists. Multiple classification methods and performance metrics were used to develop the MnSOST-4 and evaluate its predictive performance on the test set. The results from the regularized logistic regression algorithm showed that the MnSOST-4 performed well in predicting sexual recidivism in the test set, achieving an area under the curve (AUC) of 0.835. Additional analyses on the test set revealed that the MnSOST-4 outperformed the Minnesota Sex Offender Screening Tool–3 (MnSOST-3), Minnesota Sex Offender Screening Tool–Revised (MnSOST-R), and Static-99 in predicting sexual reoffending.


Sign in / Sign up

Export Citation Format

Share Document