scholarly journals Automated detection of poor-quality data: case studies in healthcare

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
M. A. Dakka ◽  
T. V. Nguyen ◽  
J. M. M. Hall ◽  
S. M. Diakiw ◽  
M. VerMilyea ◽  
...  

AbstractThe detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing. This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. Additionally, results show that Untrainable Data Cleansing could be useful as a triage tool to identify difficult clinical cases that may warrant in-depth evaluation or additional testing to support a diagnosis.

2021 ◽  
Author(s):  
M.A. Dakka ◽  
T. Nguyen ◽  
J.M.M. Hall ◽  
S.M. Diakiw ◽  
M. VerMilyea ◽  
...  

Abstract The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing. This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. Additionally, results show that Untrainable Data Cleansing could be useful as a triage tool to identify difficult clinical cases that may warrant in-depth evaluation or additional testing to support a diagnosis.


2006 ◽  
Vol 21 (1) ◽  
pp. 67-70 ◽  
Author(s):  
Brian H. Toby

The definitions for important Rietveld error indices are defined and discussed. It is shown that while smaller error index values indicate a better fit of a model to the data, wrong models with poor quality data may exhibit smaller values error index values than some superb models with very high quality data.


2019 ◽  
pp. 23-34
Author(s):  
Harvey Goldstein ◽  
Ruth Gilbert

his chapter addresses data linkage which is key to using big administrative datasets to improve efficient and equitable services and policies. These benefits need to weigh against potential harms, which have mainly focussed on privacy. In this chapter we argue for the public and researchers to be alert also to other kinds of harms. These include misuses of big administrative data through poor quality data, misleading analyses, misinterpretation or misuse of findings, and restrictions limiting what questions can be asked and by whom, resulting in research not achieved and advances not made for the public benefit. Ensuring that big administrative data are validly used for public benefit requires increased transparency about who has access and whose access is denied, how data are processed, linked and analysed, and how analyses or algorithms are used in public and private services. Public benefits and especially trust require replicable analyses by many researchers not just a few data controllers. Wider use of big data will be helped by establishing a number of safe data repositories, fully accessible to researchers and their tools, and independent of the current monopolies on data processing, linkage, enhancement and uses of data.


2017 ◽  
Vol 49 (4) ◽  
pp. 415-424 ◽  
Author(s):  
Susan WILL-WOLF ◽  
Sarah JOVAN ◽  
Michael C. AMACHER

AbstractLichen element content is a reliable indicator for relative air pollution load in research and monitoring programmes requiring both efficiency and representation of many sites. We tested the value of costly rigorous field and handling protocols for sample element analysis using five lichen species. No relaxation of rigour was supported; four relaxed protocols generated data significantly different from rigorous protocols for many of the 20 validated elements. Minimally restrictive site selection criteria gave quality data from 86% of 81 permanent plots in northern Midwest USA; more restrictive criteria would likely reduce indicator reliability. Use of trained non-specialist field collectors was supported when target species choice considers the lichen community context. Evernia mesomorpha, Flavoparmelia caperata and Physcia aipolia/stellaris were successful target species. Non-specialists were less successful at distinguishing Parmelia sulcata and Punctelia rudecta from lookalikes, leading to few samples and some poor quality data.


Geophysics ◽  
2008 ◽  
Vol 73 (2) ◽  
pp. E51-E57 ◽  
Author(s):  
Jack P. Dvorkin

Laboratory data supported by granular-medium and inclusion theories indicate that Poisson’s ratio in gas-saturated sand lies within a range of 0–0.25, with typical values of approximately 0.15. However, some well log measurements, especially in slow gas formations, persistently produce a Poisson’s ratio as large as 0.3. If this measurement is not caused by poor-quality data, three in situ situations — patchy saturation, subresolution thin layering, and elastic anisotropy — provide a plausible explanation. In the patchy saturation situation, the well data must be corrected to produce realistic synthetic seismic traces. In the second and third cases, the effect observed in a well is likely to persist at the seismic scale.


10.28945/2584 ◽  
2002 ◽  
Author(s):  
Herna L. Viktor ◽  
Wayne Motha

Increasingly, large organizations are engaging in data warehousing projects in order to achieve a competitive advantage through the exploration of the information as contained therein. It is therefore paramount to ensure that the data warehouse includes high quality data. However, practitioners agree that the improvement of the quality of data in an organization is a daunting task. This is especially evident in data warehousing projects, which are often initiated “after the fact”. The slightest suspicion of poor quality data often hinders managers from reaching decisions, when they waste hours in discussions to determine what portion of the data should be trusted. Augmenting data warehousing with data mining methods offers a mechanism to explore these vast repositories, enabling decision makers to assess the quality of their data and to unlock a wealth of new knowledge. These methods can be effectively used with inconsistent, noisy and incomplete data that are commonplace in data warehouses.


Sign in / Sign up

Export Citation Format

Share Document