scholarly journals Predicting Onset of Dementia Using Clinical Notes and Machine Learning: Case-Control Study (Preprint)

2020 ◽  
Author(s):  
Christopher A Hane ◽  
Vijay S Nori ◽  
William H Crown ◽  
Darshak M Sanghavi ◽  
Paul Bleicher

BACKGROUND Clinical trials need efficient tools to assist in recruiting patients at risk of Alzheimer disease and related dementias (ADRD). Early detection can also assist patients with financial planning for long-term care. Clinical notes are an important, underutilized source of information in machine learning models because of the cost of collection and complexity of analysis. OBJECTIVE This study aimed to investigate the use of deidentified clinical notes from multiple hospital systems collected over 10 years to augment retrospective machine learning models of the risk of developing ADRD. METHODS We used 2 years of data to predict the future outcome of ADRD onset. Clinical notes are provided in a deidentified format with specific terms and sentiments. Terms in clinical notes are embedded into a 100-dimensional vector space to identify clusters of related terms and abbreviations that differ across hospital systems and individual clinicians. RESULTS When using clinical notes, the area under the curve (AUC) improved from 0.85 to 0.94, and positive predictive value (PPV) increased from 45.07% (25,245/56,018) to 68.32% (14,153/20,717) in the model at disease onset. Models with clinical notes improved in both AUC and PPV in years 3-6 when notes’ volume was largest; results are mixed in years 7 and 8 with the smallest cohorts. CONCLUSIONS Although clinical notes helped in the short term, the presence of ADRD symptomatic terms years earlier than onset adds evidence to other studies that clinicians undercode diagnoses of ADRD. De-identified clinical notes increase the accuracy of risk models. Clinical notes collected across multiple hospital systems via natural language processing can be merged using postprocessing techniques to aid model accuracy.

10.2196/17819 ◽  
2020 ◽  
Vol 8 (6) ◽  
pp. e17819 ◽  
Author(s):  
Christopher A Hane ◽  
Vijay S Nori ◽  
William H Crown ◽  
Darshak M Sanghavi ◽  
Paul Bleicher

Background Clinical trials need efficient tools to assist in recruiting patients at risk of Alzheimer disease and related dementias (ADRD). Early detection can also assist patients with financial planning for long-term care. Clinical notes are an important, underutilized source of information in machine learning models because of the cost of collection and complexity of analysis. Objective This study aimed to investigate the use of deidentified clinical notes from multiple hospital systems collected over 10 years to augment retrospective machine learning models of the risk of developing ADRD. Methods We used 2 years of data to predict the future outcome of ADRD onset. Clinical notes are provided in a deidentified format with specific terms and sentiments. Terms in clinical notes are embedded into a 100-dimensional vector space to identify clusters of related terms and abbreviations that differ across hospital systems and individual clinicians. Results When using clinical notes, the area under the curve (AUC) improved from 0.85 to 0.94, and positive predictive value (PPV) increased from 45.07% (25,245/56,018) to 68.32% (14,153/20,717) in the model at disease onset. Models with clinical notes improved in both AUC and PPV in years 3-6 when notes’ volume was largest; results are mixed in years 7 and 8 with the smallest cohorts. Conclusions Although clinical notes helped in the short term, the presence of ADRD symptomatic terms years earlier than onset adds evidence to other studies that clinicians undercode diagnoses of ADRD. De-identified clinical notes increase the accuracy of risk models. Clinical notes collected across multiple hospital systems via natural language processing can be merged using postprocessing techniques to aid model accuracy.


2021 ◽  
Author(s):  
Abul Hasan ◽  
Mark Levene ◽  
David Weston ◽  
Renate Fromson ◽  
Nicolas Koslover ◽  
...  

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.


Author(s):  
Son

Extracting keywords from documents is an essential task in natural language processing. A challenge of this task is to define a reasonable set of keywords from which we can find all relevant documents. This paper proposes a new approach that exploits word-level handcrafted features and machine learning models to select a single document's most important keywords. To evaluate the proposed solution, we compare our results with the latest supervised and unsupervised automatic keyword extraction methods. Experiment results show that our model achieves the best results on the 9/20 data corpus. It points out that our proposed approach is promising.


2020 ◽  
Vol 19 (1) ◽  
pp. 37-50
Author(s):  
M. Makino ◽  
T. Odaka ◽  
J. Kuroiwa ◽  
I. Suwa ◽  
H. Shirai

AbstractIn tennis, the accumulation of data has progressed and research on tactical analysis has been conducted. Estimating strategically important factors would have the benefit of providing players with useful advice and helping audience members understand what tennis players are good at. Previous research has been conducted into ways of predicting Association of Tennis Professionals (ATP) tennis match outcomes as well as estimating factors that are important for victories using machine learning models. The challenge of previous research is that the victory factor lacks concreteness. Since we thought the root of the abovementioned problem was that previous researchers used game summary as a feature and did not consider the process of rallies between points, this research focused on calculating the frequency of single shots, two-shot patterns, and specific effective shot patterns from each point rally of ATP singles matches. We then used those data to predict point winners and useful features using L1-regularized logistic regression. The highest accuracy obtained was 66.5%, and the area under the curve (AUC) was 0.689. The most prominent feature we found was the ratio of specific shots by specific players. From these results, our method could reveal more concretely tactical factors than previous studies.


2020 ◽  
Vol 17 (8) ◽  
pp. 3776-3781
Author(s):  
M. Adimoolam ◽  
Raghav Sharma ◽  
A. John ◽  
M. Suresh Kumar ◽  
K. Ashok Kumar

In the past few decades human beings have knowledgeable tremendous intensification in the interaction in particular micro blogging websites and various social media as online resources. Many kinds of data have been used and classification data to group and store are challenging in this real world scenario. Various machine and Natural Language Processing (NLP) were being applied to analysis the sentiment. A major concentration of this work was on using several machine learning algorithms to perform sentimental analysis and comparing various machine learning models for the sentiment classification. This work analysed various sentimental using multiple classifications. From the evaluation of this experiment, it can be concluded that NLP and machine learning Techniques are efficient for sentimental analysis.


2020 ◽  
Vol 27 (3) ◽  
pp. 437-443 ◽  
Author(s):  
Zina M Ibrahim ◽  
Honghan Wu ◽  
Ahmed Hamoud ◽  
Lukas Stappen ◽  
Richard J B Dobson ◽  
...  

Abstract Objectives Current machine learning models aiming to predict sepsis from electronic health records (EHR) do not account 20 for the heterogeneity of the condition despite its emerging importance in prognosis and treatment. This work demonstrates the added value of stratifying the types of organ dysfunction observed in patients who develop sepsis in the intensive care unit (ICU) in improving the ability to recognize patients at risk of sepsis from their EHR data. Materials and Methods Using an ICU dataset of 13 728 records, we identify clinically significant sepsis subpopulations with distinct organ dysfunction patterns. We perform classification experiments with random forest, gradient boost trees, and support vector machines, using the identified subpopulations to distinguish patients who develop sepsis in the ICU from those who do not. Results The classification results show that features selected using sepsis subpopulations as background knowledge yield a superior performance in distinguishing septic from non-septic patients regardless of the classification model used. The improved performance is especially pronounced in specificity, which is a current bottleneck in sepsis prediction machine learning models. Conclusion Our findings can steer machine learning efforts toward more personalized models for complex conditions including sepsis.


Author(s):  
Jiancheng Ye ◽  
Liang Yao ◽  
Jiahong Shen ◽  
Rethavathi Janarthanam ◽  
Yuan Luo

Abstract Background Diabetes mellitus is a prevalent metabolic disease characterized by chronic hyperglycemia. The avalanche of healthcare data is accelerating precision and personalized medicine. Artificial intelligence and algorithm-based approaches are becoming more and more vital to support clinical decision-making. These methods are able to augment health care providers by taking away some of their routine work and enabling them to focus on critical issues. However, few studies have used predictive modeling to uncover associations between comorbidities in ICU patients and diabetes. This study aimed to use Unified Medical Language System (UMLS) resources, involving machine learning and natural language processing (NLP) approaches to predict the risk of mortality. Methods We conducted a secondary analysis of Medical Information Mart for Intensive Care III (MIMIC-III) data. Different machine learning modeling and NLP approaches were applied. Domain knowledge in health care is built on the dictionaries created by experts who defined the clinical terminologies such as medications or clinical symptoms. This knowledge is valuable to identify information from text notes that assert a certain disease. Knowledge-guided models can automatically extract knowledge from clinical notes or biomedical literature that contains conceptual entities and relationships among these various concepts. Mortality classification was based on the combination of knowledge-guided features and rules. UMLS entity embedding and convolutional neural network (CNN) with word embeddings were applied. Concept Unique Identifiers (CUIs) with entity embeddings were utilized to build clinical text representations. Results The best configuration of the employed machine learning models yielded a competitive AUC of 0.97. Machine learning models along with NLP of clinical notes are promising to assist health care providers to predict the risk of mortality of critically ill patients. Conclusion UMLS resources and clinical notes are powerful and important tools to predict mortality in diabetic patients in the critical care setting. The knowledge-guided CNN model is effective (AUC = 0.97) for learning hidden features.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Fengyi Zhang ◽  
Xinyuan Cui ◽  
Renrong Gong ◽  
Chuan Zhang ◽  
Zhigao Liao

This study aimed to provide effective methods for the identification of surgeries with high cancellation risk based on machine learning models and analyze the key factors that affect the identification performance. The data covered the period from January 1, 2013, to December 31, 2014, at West China Hospital in China, which focus on elective urologic surgeries. All surgeries were scheduled one day in advance, and all cancellations were of institutional resource- and capacity-related types. Feature selection strategies, machine learning models, and sampling methods are the most discussed topic in general machine learning researches and have a direct impact on the performance of machine learning models. Hence, they were considered to systematically generate complete schemes in machine learning-based identification of surgery cancellations. The results proved the feasibility and robustness of identifying surgeries with high cancellation risk, with the considerable maximum of area under the curve (AUC) (0.7199) for random forest model with original sampling using backward selection strategy. In addition, one-side Delong test and sum of square error analysis were conducted to measure the effects of feature selection strategy, machine learning model, and sampling method on the identification of surgeries with high cancellation risk, and the selection of machine learning model was identified as the key factors that affect the identification of surgeries with high cancellation risk. This study offers methodology and insights for identifying the key experimental factors for identifying surgery cancellations, and it is helpful to further research on machine learning-based identification of surgeries with high cancellation risk.


Sign in / Sign up

Export Citation Format

Share Document