scholarly journals Decoding semi-automated title-abstract screening: a retrospective exploration of the review, study, and publication characteristics associated with accurate relevance predictions

2020 ◽  
Author(s):  
Allison Gates ◽  
Michelle Gates ◽  
Daniel DaRosa ◽  
Sarah A. Elliott ◽  
Jennifer Pillay ◽  
...  

Abstract Background We evaluated the benefits and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening, and explored whether Abstrackr’s predictions varied by review or study-level characteristics. Methods For 16 reviews we screened a 200-record training set in Abstrackr and downloaded the predicted relevance of the remaining records. We retrospectively simulated the liberal-accelerated screening approach: one reviewer screened the records predicted as relevant; a second reviewer screened those predicted as irrelevant and those excluded by the first reviewer. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool’s predictions varied by review and study-level characteristics using Fisher’s Exact and unpaired t-tests. Results Using the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records but saved a median (IQR) 26 (33) hours of screening time. Removing missed studies from meta-analyses did not alter the reviews’ conclusions. Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P=0.37) or intervention type (simple or complex, P=0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P=0.01), or that included only trials (95%) vs. multiple designs (86%) (P=0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P=0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P=0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P=0.02) were more often correctly predicted as relevant. Conclusion Our screening approach saved time and may be suitable in conditions where the limited risk of missing relevant records is acceptable. ML-assisted screening may be most trustworthy for reviews that seek to include only trials. Several of our findings are paradoxical, and require further study to fully understand the tasks to which ML-assisted screening is best suited.

2020 ◽  
Author(s):  
Allison Gates ◽  
Michelle Gates ◽  
Daniel DaRosa ◽  
Sarah A. Elliott ◽  
Jennifer Pillay ◽  
...  

Abstract Background. We evaluated the benefits and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening, and explored whether Abstrackr’s predictions varied by review or study-level characteristics.Methods. For a convenient sample of 16 reviews for which adequate data were available to address our objectives (11 systematic reviews and 5 rapid reviews) we screened a 200-record training set in Abstrackr and downloaded the relevance (relevant or irrelevant) of the remaining records, as predicted by the tool. We retrospectively simulated the liberal-accelerated screening approach. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool’s predictions varied by review and study-level. Results. Using the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records that were included in the final reports, but saved a median (IQR) 26 (9, 42) hours of screening time. One missed study was included in eight pairwise meta-analyses in one systematic review. The pooled effect for just one of those meta-analyses changed considerably (from MD (95% CI) -1.53 (-2.92, -0.15) to -1.17 (-2.70, 0.36)). Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P=0.37) or intervention type (simple or complex, P=0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P=0.01), or that included only trials (95%) vs. multiple designs (86%) (P=0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P=0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P=0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P=0.02) were more often correctly predicted as relevant. Conclusion. Our screening approach saved time and may be suitable in conditions where the limited risk of missing relevant records is acceptable. As several of our findings are paradoxical, and require further study to fully understand the tasks to which ML-assisted screening is best suited.


2020 ◽  
Vol 9 (1) ◽  
Author(s):  
Allison Gates ◽  
Michelle Gates ◽  
Daniel DaRosa ◽  
Sarah A. Elliott ◽  
Jennifer Pillay ◽  
...  

Abstract Background We evaluated the benefits and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening and explored whether Abstrackr’s predictions varied by review or study-level characteristics. Methods For a convenience sample of 16 reviews for which adequate data were available to address our objectives (11 systematic reviews and 5 rapid reviews), we screened a 200-record training set in Abstrackr and downloaded the relevance (relevant or irrelevant) of the remaining records, as predicted by the tool. We retrospectively simulated the liberal-accelerated screening approach. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool’s predictions varied by review and study-level characteristics. Results Using the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records that were included in the final reports, but saved a median (IQR) 26 (9, 42) h of screening time. One missed study was included in eight pairwise meta-analyses in one systematic review. The pooled effect for just one of those meta-analyses changed considerably (from MD (95% CI) − 1.53 (− 2.92, − 0.15) to − 1.17 (− 2.70, 0.36)). Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P = 0.37) or intervention type (simple or complex, P = 0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P = 0.01), or that included only trials (95%) vs. multiple designs (86%) (P = 0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P = 0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P = 0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P = 0.02) were more often correctly predicted as relevant. Conclusion Our screening approach saved time and may be suitable in conditions where the limited risk of missing relevant records is acceptable. Several of our findings are paradoxical and require further study to fully understand the tasks to which ML-assisted screening is best suited. The findings should be interpreted in light of the fact that the protocol was prepared for the funder, but not published a priori. Because we used a convenience sample, the findings may be prone to selection bias. The results may not be generalizable to other samples of reviews, ML tools, or screening approaches. The small number of missed studies across reviews with pairwise meta-analyses hindered strong conclusions about the effect of missed studies on the results and conclusions of systematic reviews.


2020 ◽  
Author(s):  
Allison Gates ◽  
Michelle Gates ◽  
Daniel DaRosa ◽  
Sarah A. Elliott ◽  
Jennifer Pillay ◽  
...  

Abstract Background. We evaluated the benefits and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening, and explored whether Abstrackr’s predictions varied by review or study-level characteristics.Methods. For a convenience sample of 16 reviews for which adequate data were available to address our objectives (11 systematic reviews and 5 rapid reviews) we screened a 200-record training set in Abstrackr and downloaded the relevance (relevant or irrelevant) of the remaining records, as predicted by the tool. We retrospectively simulated the liberal-accelerated screening approach. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool’s predictions varied by review and study-level.Results. Using the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records that were included in the final reports, but saved a median (IQR) 26 (9, 42) hours of screening time. One missed study was included in eight pairwise meta-analyses in one systematic review. The pooled effect for just one of those meta-analyses changed considerably (from MD (95% CI) -1.53 (-2.92, -0.15) to -1.17 (-2.70, 0.36)). Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P=0.37) or intervention type (simple or complex, P=0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P=0.01), or that included only trials (95%) vs. multiple designs (86%) (P=0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P=0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P=0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P=0.02) were more often correctly predicted as relevant.Conclusion. Our screening approach saved time and may be suitable in conditions where the limited risk of missing relevant records is acceptable. As several of our findings are paradoxical, and require further study to fully understand the tasks to which ML-assisted screening is best suited. The findings should be interpreted in light of the fact that the protocol was prepared for the funder, but not published a priori. Because we used a convenience sample the findings may be prone to selection bias. The results may not be generalizable to other samples of reviews, ML tools, or screening approaches. The small number of missed studies across reviews with pairwise meta-analyses hindered strong conclusions about the effect of missed studies on the results and conclusions of systematic reviews.


2021 ◽  
Vol 10 (1) ◽  
pp. 96
Author(s):  
Saeid Eslami ◽  
Raheleh Ganjali

Introduction: On March 20, 2020, the World Health Organization (WHO) announced the spread of SARS-CoV-2 infection in most countries worldwide as a pandemic. COVID-19 is mainly disseminated through human-to-human transmission route via direct contact and respiratory droplets. Telehealth and/or telemedicine technologies are beneficial methods that could be employed to deal with pandemic situation of communicable infections. The purpose of this proposed systematic review study is to sum up the functionalities, applications, and technologies of telemedicine during COVID-19 outbreak.Material and Methods: This review will be carried out in accordance with the Cochrane Handbook and PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) reporting guidelines. PubMed and Scopus databases were searched for related articles. Randomized and non-randomized controlled trials published in English in scientific journals were identified to be evaluated for eligibility. Articles conducted on telemedicine services (TMS) during COVID-19 outbreak (2019-2020) were identified to be evaluated.Results: The literature search for related articles in PubMed and Scopus databases led to the identification and retrieval of a total of 1118 and 485 articles, respectively. After eliminating duplicate articles, title and abstract screening process was performed for the remaining 1440 articles. The current study findings are anticipated to be used as a guide by researchers, decision makers, and managers to design, implement, and assess TMS during COVID-19 crisis.Conclusion: As far as we know, this systematic review is conducted to comprehensively evaluate TM methods and technologies developed with the aim of controlling and managing COVID-19 pandemic. This study highlights important applications of telemedicine in pandemic conditions, which could be employed by future health systems in controlling and managing communicable infections when an outbreak occurs.


2020 ◽  
Vol 20 (1) ◽  
Author(s):  
C. Hamel ◽  
S. E. Kelly ◽  
K. Thavorn ◽  
D. B. Rice ◽  
G. A. Wells ◽  
...  

Abstract Background Systematic reviews often require substantial resources, partially due to the large number of records identified during searching. Although artificial intelligence may not be ready to fully replace human reviewers, it may accelerate and reduce the screening burden. Using DistillerSR (May 2020 release), we evaluated the performance of the prioritization simulation tool to determine the reduction in screening burden and time savings. Methods Using a true recall @ 95%, response sets from 10 completed systematic reviews were used to evaluate: (i) the reduction of screening burden; (ii) the accuracy of the prioritization algorithm; and (iii) the hours saved when a modified screening approach was implemented. To account for variation in the simulations, and to introduce randomness (through shuffling the references), 10 simulations were run for each review. Means, standard deviations, medians and interquartile ranges (IQR) are presented. Results Among the 10 systematic reviews, using true recall @ 95% there was a median reduction in screening burden of 47.1% (IQR: 37.5 to 58.0%). A median of 41.2% (IQR: 33.4 to 46.9%) of the excluded records needed to be screened to achieve true recall @ 95%. The median title/abstract screening hours saved using a modified screening approach at a true recall @ 95% was 29.8 h (IQR: 28.1 to 74.7 h). This was increased to a median of 36 h (IQR: 32.2 to 79.7 h) when considering the time saved not retrieving and screening full texts of the remaining 5% of records not yet identified as included at title/abstract. Among the 100 simulations (10 simulations per review), none of these 5% of records were a final included study in the systematic review. The reduction in screening burden to achieve true recall @ 95% compared to @ 100% resulted in a reduced screening burden median of 40.6% (IQR: 38.3 to 54.2%). Conclusions The prioritization tool in DistillerSR can reduce screening burden. A modified or stop screening approach once a true recall @ 95% is achieved appears to be a valid method for rapid reviews, and perhaps systematic reviews. This needs to be further evaluated in prospective reviews using the estimated recall.


2013 ◽  
Vol 12 (4) ◽  
pp. 157-169 ◽  
Author(s):  
Philip L. Roth ◽  
Allen I. Huffcutt

The topic of what interviews measure has received a great deal of attention over the years. One line of research has investigated the relationship between interviews and the construct of cognitive ability. A previous meta-analysis reported an overall corrected correlation of .40 ( Huffcutt, Roth, & McDaniel, 1996 ). A more recent meta-analysis reported a noticeably lower corrected correlation of .27 ( Berry, Sackett, & Landers, 2007 ). After reviewing both meta-analyses, it appears that the two studies posed different research questions. Further, there were a number of coding judgments in Berry et al. that merit review, and there was no moderator analysis for educational versus employment interviews. As a result, we reanalyzed the work by Berry et al. and found a corrected correlation of .42 for employment interviews (.15 higher than Berry et al., a 56% increase). Further, educational interviews were associated with a corrected correlation of .21, supporting their influence as a moderator. We suggest a better estimate of the correlation between employment interviews and cognitive ability is .42, and this takes us “back to the future” in that the better overall estimate of the employment interviews – cognitive ability relationship is roughly .40. This difference has implications for what is being measured by interviews and their incremental validity.


BMC Medicine ◽  
2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Perrine Janiaud ◽  
Arnav Agarwal ◽  
Ioanna Tzoulaki ◽  
Evropi Theodoratou ◽  
Konstantinos K. Tsilidis ◽  
...  

Abstract Background The validity of observational studies and their meta-analyses is contested. Here, we aimed to appraise thousands of meta-analyses of observational studies using a pre-specified set of quantitative criteria that assess the significance, amount, consistency, and bias of the evidence. We also aimed to compare results from meta-analyses of observational studies against meta-analyses of randomized controlled trials (RCTs) and Mendelian randomization (MR) studies. Methods We retrieved from PubMed (last update, November 19, 2020) umbrella reviews including meta-analyses of observational studies assessing putative risk or protective factors, regardless of the nature of the exposure and health outcome. We extracted information on 7 quantitative criteria that reflect the level of statistical support, the amount of data, the consistency across different studies, and hints pointing to potential bias. These criteria were level of statistical significance (pre-categorized according to 10−6, 0.001, and 0.05 p-value thresholds), sample size, statistical significance for the largest study, 95% prediction intervals, between-study heterogeneity, and the results of tests for small study effects and for excess significance. Results 3744 associations (in 57 umbrella reviews) assessed by a median number of 7 (interquartile range 4 to 11) observational studies were eligible. Most associations were statistically significant at P < 0.05 (61.1%, 2289/3744). Only 2.6% of associations had P < 10−6, ≥1000 cases (or ≥20,000 participants for continuous factors), P < 0.05 in the largest study, 95% prediction interval excluding the null, and no large between-study heterogeneity, small study effects, or excess significance. Across the 57 topics, large heterogeneity was observed in the proportion of associations fulfilling various quantitative criteria. The quantitative criteria were mostly independent from one another. Across 62 associations assessed in both RCTs and in observational studies, 37.1% had effect estimates in opposite directions and 43.5% had effect estimates differing beyond chance in the two designs. Across 94 comparisons assessed in both MR and observational studies, such discrepancies occurred in 30.8% and 54.7%, respectively. Conclusions Acknowledging that no gold-standard exists to judge whether an observational association is genuine, statistically significant results are common in observational studies, but they are rarely convincing or corroborated by randomized evidence.


1985 ◽  
Vol 4 (4) ◽  
pp. 349-364 ◽  
Author(s):  
Roni Beth Tower

In a study of forty-three preschool children, ratings of four types of the children's imaginativeness were correlated with observational, behavioral, and interview measures. Research questions were: 1) Do correlates of imaginativeness found in observational studies replicate if trait rather than state measures are examined? 2) Do different types of imaginativeness have different correlates? and 3) What characteristics distinguish children at the maladaptive extremes of imaginativeness from those at more moderate levels? The conceptual and empirical utility of considering imaginativeness to have two dimensions, Expressive and Constructive, was demonstrated. While optimal levels of Constructive Imaginativeness correlated significantly with other indices of healthy child development, the correlations were fewer and tended to be weaker for Expressive Imaginativeness. The negative implication of extremes was documented.


Sign in / Sign up

Export Citation Format

Share Document