Medical Analytics in the Presence of Human-Error. An Exploration of EMR Data Quality using MIMIC-III. (Preprint)

BACKGROUND Public Electronic Medical Records (EMR) datasets are a goldmine for vendors and researchers seeking to develop analytics designed to assist caregivers in monitoring, diagnosis, and treatment of patients. Both complex machine-learning-based tools, which require copious amounts of data to train, and a simple trend graph presented in a patient-centered dashboard, are sensitive to noise. OBJECTIVE We aim to systematically explore data errors in MIMIC-III as a representative of secondary use datasets and the impact of these errors on downstream analytics. METHODS We discuss the unique challenge of accounting for the specific patient's medical condition and personal characteristics such as age, weight, gender, and others, in identifying data errors when only a few measurements of each patient are available. To do so, we examine the prevalence and manifestations of errors in one of the most popular public medical research databases - MIMIC-III. We then evaluate how these errors impact visual analytics, score-based sepsis analytics SOFA and qSOFA, and a machine-learning-based sepsis predictor. RESULTS We find a variety of error patterns in MIMIC-III and highlight effective methods to find them. All analytics are found to be sensitive to sporadic error. Visual analytics are severely impacted, limiting their usefulness in the presence of error. qSOFA and SOFA suffer a score change of +1 (of 3) and +2.3-4 (of 15). The sepsis predictor suffers from a 0.01-0.3 score change compared to a median score of 0.08. CONCLUSIONS The use of statistical methods to detect data errors is limited to high-throughput scenarios and large data aggregations. There is a dearth of medical guidelines and error-detection practices to support rule-based systems, required to keep analytics safe and trustworthy in low-volume scenarios. Analytics developers should test their software’s sensitivity to error on public datasets. The medical informatics community should improve support for medical data-quality endeavors by creating guidelines for plausible values and analytics robustness to error and collecting real-world dirty datasets which contain errors as they appear in normal EMR use.

Download Full-text

Using Machine Learning for Dependable Outlier Detection in Environmental Monitoring Systems

ACM Transactions on Cyber-Physical Systems ◽

10.1145/3445812 ◽

2021 ◽

Vol 5 (3) ◽

pp. 1-30

Author(s):

Gonçalo Jesus ◽

António Casimiro ◽

Anabela Oliveira

Keyword(s):

Machine Learning ◽

Environmental Monitoring ◽

Data Quality ◽

Outlier Detection ◽

Prediction Models ◽

Sensor Data ◽

Natural Phenomenon ◽

Monitoring Systems ◽

Data Errors ◽

Redundant Data

Sensor platforms used in environmental monitoring applications are often subject to harsh environmental conditions while monitoring complex phenomena. Therefore, designing dependable monitoring systems is challenging given the external disturbances affecting sensor measurements. Even the apparently simple task of outlier detection in sensor data becomes a hard problem, amplified by the difficulty in distinguishing true data errors due to sensor faults from deviations due to natural phenomenon, which look like data errors. Existing solutions for runtime outlier detection typically assume that the physical processes can be accurately modeled, or that outliers consist in large deviations that are easily detected and filtered by appropriate thresholds. Other solutions assume that it is possible to deploy multiple sensors providing redundant data to support voting-based techniques. In this article, we propose a new methodology for dependable runtime detection of outliers in environmental monitoring systems, aiming to increase data quality by treating them. We propose the use of machine learning techniques to model each sensor behavior, exploiting the existence of correlated data provided by other related sensors. Using these models, along with knowledge of processed past measurements, it is possible to obtain accurate estimations of the observed environment parameters and build failure detectors that use these estimations. When a failure is detected, these estimations also allow one to correct the erroneous measurements and hence improve the overall data quality. Our methodology not only allows one to distinguish truly abnormal measurements from deviations due to complex natural phenomena, but also allows the quantification of each measurement quality, which is relevant from a dependability perspective. We apply the methodology to real datasets from a complex aquatic monitoring system, measuring temperature and salinity parameters, through which we illustrate the process for building the machine learning prediction models using a technique based on Artificial Neural Networks, denoted ANNODE ( ANN Outlier Detection ). From this application, we also observe the effectiveness of our ANNODE approach for accurate outlier detection in harsh environments. Then we validate these positive results by comparing ANNODE with state-of-the-art solutions for outlier detection. The results show that ANNODE improves existing solutions regarding accuracy of outlier detection.

Download Full-text

Some Evidence on the Detection of Data Errors

Advances in Information Resources Management - Advanced Topics in Information Resources Management, Volume 1 ◽

10.4018/978-1-930708-44-0.ch016 ◽

2002 ◽

pp. 279-295

Author(s):

Barbara D. Klein

Keyword(s):

Information Systems ◽

Error Detection ◽

Business Processes ◽

Municipal Bond ◽

Organizational Impact ◽

Organizational Settings ◽

Data Errors ◽

The Impact ◽

Published Research ◽

Correct Data

Data stored in organizational databases have a significant error rate. As computerized databases continue to proliferate, the number of errors in stored data and the organizational impact of these errors are likely to increase. The impact of data errors on business processes and decision making can be lessened if users of information systems are able and willing to detect and correct data errors. However, some published research suggests that users of information systems do not detect data errors. This paper reports the results of a study showing that municipal bond analysts detect data errors. The results provide insight into the conditions under which users in organizational settings detect data errors. Guidelines for improving error detection are also discussed.

Download Full-text

FRAMEWORK FOR DATA QUALITY ASSURANCE BETWEEN COMPOSITE SERVICES

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194009004180 ◽

2009 ◽

Vol 19 (03) ◽

pp. 307-337 ◽

Cited By ~ 1

Author(s):

JUNG-WON LEE ◽

BYOUNGJU CHOI

Keyword(s):

Quality Assurance ◽

Data Quality ◽

Error Detection ◽

Service Oriented Architecture ◽

Data Transformation ◽

Data Errors ◽

Semantic Data ◽

Composite Services ◽

Data Constraints ◽

Learning Data

Today, businesses have to respond with flexibility and speed to ever-changing customer demand and market opportunities. Service-Oriented Architecture (SOA) is the best methodology for developing new services and integrating them with adaptability — the ability to respond to changing and new requirements. In this paper, we propose a framework for ensuring data quality between composite services, which solves semantic data transformation problems during service composition and detects data errors during service execution at the same time. We also minimize the human intervention by learning data constraints as a basis of data transformation and error detection. We developed a data quality assurance service based on SOA, which makes it possible to improve the quality of services and to manage data effectively for a variety of SOA-based applications. As an empirical study, we applied the service to detect data errors between CRM and ERP services and showed that the data error rate could be reduced by more than 30%. We also showed automation rate for setting detection rule is over 41% by learning data constraints from multiple registered services in the field of business.

Download Full-text

MLGaze: Machine Learning-Based Analysis of Gaze Error Patterns in Consumer Eye Tracking Systems

Vision ◽

10.3390/vision4020025 ◽

2020 ◽

Vol 4 (2) ◽

pp. 25

Author(s):

Anuradha Kar

Keyword(s):

Machine Learning ◽

Eye Tracking ◽

Data Quality ◽

Regression Models ◽

Operating Conditions ◽

Eye Tracker ◽

Error Sources ◽

Error Patterns ◽

Machine Learning Methods ◽

The Impact

Analyzing the gaze accuracy characteristics of an eye tracker is a critical task as its gaze data is frequently affected by non-ideal operating conditions in various consumer eye tracking applications. In previous research on pattern analysis of gaze data, efforts were made to model human visual behaviors and cognitive processes. What remains relatively unexplored are questions related to identifying gaze error sources as well as quantifying and modeling their impacts on the data quality of eye trackers. In this study, gaze error patterns produced by a commercial eye tracking device were studied with the help of machine learning algorithms, such as classifiers and regression models. Gaze data were collected from a group of participants under multiple conditions that commonly affect eye trackers operating on desktop and handheld platforms. These conditions (referred here as error sources) include user distance, head pose, and eye-tracker pose variations, and the collected gaze data were used to train the classifier and regression models. It was seen that while the impact of the different error sources on gaze data characteristics were nearly impossible to distinguish by visual inspection or from data statistics, machine learning models were successful in identifying the impact of the different error sources and predicting the variability in gaze error levels due to these conditions. The objective of this study was to investigate the efficacy of machine learning methods towards the detection and prediction of gaze error patterns, which would enable an in-depth understanding of the data quality and reliability of eye trackers under unconstrained operating conditions. Coding resources for all the machine learning methods adopted in this study were included in an open repository named MLGaze to allow researchers to replicate the principles presented here using data from their own eye trackers.

Download Full-text

Data quality issues leading to sub optimal machine learning for money laundering models

Journal of Money Laundering Control ◽

10.1108/jmlc-05-2021-0049 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Abhishek Gupta ◽

Dwijendra Nath Dwivedi ◽

Jigar Shah ◽

Ashish Jain

Keyword(s):

Machine Learning ◽

Data Quality ◽

Money Laundering ◽

Time Lag ◽

Time Duration ◽

Content Type ◽

Case Closure ◽

Definition Of ◽

The Impact

Purpose Good quality input data is critical to developing a robust machine learning model for identifying possible money laundering transactions. McKinsey, during one of the conferences of ACAMS, attributed data quality as one of the reasons for struggling artificial intelligence use cases in compliance to data. There were often use concerns raised on data quality of predictors such as wrong transaction codes, industry classification, etc. However, there has not been much discussion on the most critical variable of machine learning, the definition of an event, i.e. the date on which the suspicious activity reports (SAR) is filed. Design/methodology/approach The team analyzed the transaction behavior of four major banks spread across Asia and Europe. Based on the findings, the team created a synthetic database comprising 2,000 SAR customers mimicking the time of investigation and case closure. In this paper, the authors focused on one very specific area of data quality, the definition of an event, i.e. the SAR/suspicious transaction report. Findings The analysis of few of the banks in Asia and Europe suggests that this itself can improve the effectiveness of model and reduce the prediction span, i.e. the time lag between money laundering transaction done and prediction of money laundering as an alert for investigation Research limitations/implications The analysis was done with existing experience of all situations where the time duration between alert and case closure is high (anywhere between 15 days till 10 months). Team could not quantify the impact of this finding due to lack of such actual case observed so far. Originality/value The key finding from paper suggests that the money launderers typically either increase their level of activity or reduce their activity in the recent quarter. This is not true in terms of real behavior. They typically show a spike in activity through various means during money laundering. This in turn impacts the quality of insights that the model should be trained on. The authors believe that once the financial institutions start speeding up investigations on high risk cases, the scatter plot of SAR behavior will change significantly and will lead to better capture of money laundering behavior and a faster and more precise “catch” rate.

Download Full-text

Comparative evaluation of contribution-value plots for machine learning understanding

Journal of Visualization ◽

10.1007/s12650-021-00776-w ◽

2021 ◽

Author(s):

Dennis Collaris ◽

Jarke J. van Wijk

Keyword(s):

Machine Learning ◽

Visual Analytics ◽

User Study ◽

Design Decisions ◽

Model Interpretation ◽

Novel Approach ◽

Definition Of ◽

The Impact ◽

The Relationship ◽

Strict Definition

Abstract The field of explainable artificial intelligence aims to help experts understand complex machine learning models. One key approach is to show the impact of a feature on the model prediction. This helps experts to verify and validate the predictions the model provides. However, many challenges remain open. For example, due to the subjective nature of interpretability, a strict definition of concepts such as the contribution of a feature remains elusive. Different techniques have varying underlying assumptions, which can cause inconsistent and conflicting views. In this work, we introduce local and global contribution-value plots as a novel approach to visualize feature impact on predictions and the relationship with feature value. We discuss design decisions and show an exemplary visual analytics implementation that provides new insights into the model. We conducted a user study and found the visualizations aid model interpretation by increasing correctness and confidence and reducing the time taken to obtain an insight. Graphic Abstract

Download Full-text

Project based learning in Biomedical Data Science using the MIMIC III open dataset

Proceedings INNODOCT/20. International Conference on Innovation, Documentation and Education ◽

10.4995/inn2020.2020.11890 ◽

2020 ◽

Author(s):

Luis Alcalá ◽

Juan M García-Gómez ◽

Carlos Sáez

Keyword(s):

Machine Learning ◽

Life Cycle ◽

Data Quality ◽

Learning Outcomes ◽

Health Information ◽

Project Based Learning ◽

Learning Approach ◽

Biomedical Data ◽

The Subject ◽

Mimic Iii

The subjects Health Information Systems and Telemedicine and Data Quality and Interoperability of the Degree and Master in Biomedical Engineering of the Universitat Politècnica de València, Spain, address learning outcomes related to managing and processing biomedical databases, using health information standards for data capture and exchange, data quality assessment, and developing machine-learning models from these data. These learning outcomes cover a large range of distinct activities in the biomedical data life-cycle, what may hinder the learning process in the limited time assigned for the subject. We propose a project based learning approach addressing the full life-cycle of biomedical data on the MIMIC-III (Medical Information Mart for Intensive Care III) Open Dataset, a freely accessible database comprising information relating to patients admitted to critical care units. By means of this active learning approach, students can achieve all the learning outcomes of the subject in an integrated manner: understanding the MIMIC-III data model, using health information standards such as International Classification of Diseases 9th Edition (ICD-9), mapping to interoperability standards, querying data, creating data tables and addressing data quality towards applying reliable statistical and machine learning analysis and, developing predictive models for several tasks such as predicting in-hospital mortality. MIMIC-III is widely used in the academia and science, with a large amount of publicly available resources and scientific articles to support the students learning. Additionally, the students will gain new competences in the use of Open Data and Research Ethics and Compliance Training.

Download Full-text

Forecasting US movies box office performances in Turkey using machine learning algorithms

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189120 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6579-6590

Author(s):

Sandy Çağlıyor ◽

Başar Öztayşi ◽

Selime Sezgin

Keyword(s):

Machine Learning ◽

Global Economy ◽

Learning Algorithms ◽

Forecast Model ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

High Stakes ◽

Box Office ◽

Industry Forecast ◽

The Impact

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.

Download Full-text

The impact of economic plans on the Chinese education system: a machine learning approach

CADMO ◽

10.3280/cad2018-001005 ◽

2018 ◽

pp. 37-49

Author(s):

Wenjun Lin ◽

Xuefu Xu ◽

Francesco Dell’Anna

Keyword(s):

Machine Learning ◽

Education System ◽

Learning Approach ◽

Chinese Education ◽

System A ◽

Machine Learning Approach ◽

The Impact

Download Full-text

(Preprint)

10.2196/preprints.12150 ◽

2018 ◽

Author(s):

Natalia Banasik ◽

Dariusz Jemielniak ◽

Wojciech P?dzich

Keyword(s):

Medical Condition ◽

Extended Period ◽

Well Being ◽

Marginal Effect ◽

Life Duration ◽

Significant Difference ◽

The Mean ◽

Academic Teachers ◽

Length Of Life ◽

The Impact

BACKGROUND There have been mixed results of the studies checking whether prayers do actually extend the life duration of the people prayed for. Most studies on the topic included a small number of prayers and most of them focused on people already struggling with a medical condition. Intercessory prayer’s influence on health is of scholarly interest, yet it is unclear if its effect may be dependent on the number of prayers for a named individual received per annum. OBJECTIVE We sought to examine if there is a noticeable increased longevity effect of intercessory prayer for a named individual’s well-being, if he receives a very high number of prayers per annum for an extended period. METHODS We retrieved and conducted a statistical analysis of the data about the length of life for 857 Roman Catholic bishops, 500 Catholic priests, and 3038 male academics from the US, France, Italy, Poland, Brazil, and Mexico. We obtained information for these individuals who died between 1988 and 2018 from Wikidata, and conducted an observational cohort study. Bishops were chosen for the study, as they receive millions of individual prayers for well being, according to conservative estimates. RESULTS There was a main effect for occupation F(2, 4391) = 4.07, p = .017, ηp 2 = .002, with pairwise comparisons indicating significant differences between the mean life duration of bishops (M=30489) and of priests (M=29894), but none between the academic teachers (M=30147) and either of the other groups. A comparison analysis between bishops from the largest and the smallest dioceses showed no significant difference t(67.31)=1.61, p = .11. Our main outcome measure is covariance of the mean length of life in each of the categories: bishops, priests, academic teachers, controlled for nationality. CONCLUSIONS The first analysis proved that bishops live longer than priests, but due to a marginal effect size this result should be treated with caution. No difference was found between the mean length of life of bishops from the largest and the smallest dioceses. We found no difference between bishops and male academics. These results show that the impact of intercessory prayers on longevity is not observable.

Download Full-text