scholarly journals What health records data are required for accurate prediction of suicidal behavior?

2019 ◽  
Vol 26 (12) ◽  
pp. 1458-1465 ◽  
Author(s):  
Gregory E Simon ◽  
Susan M Shortreed ◽  
Eric Johnson ◽  
Rebecca C Rossom ◽  
Frances L Lynch ◽  
...  

Abstract Objective The study sought to evaluate how availability of different types of health records data affect the accuracy of machine learning models predicting suicidal behavior. Materials and Methods Records from 7 large health systems identified 19 061 056 outpatient visits to mental health specialty or general medical providers between 2009 and 2015. Machine learning models (logistic regression with penalized LASSO [least absolute shrinkage and selection operator] variable selection) were developed to predict suicide death (n = 1240) or probable suicide attempt (n = 24 133) in the following 90 days. Base models were used only historical insurance claims data and were then augmented with data regarding sociodemographic characteristics (race, ethnicity, and neighborhood characteristics), past patient-reported outcome questionnaires from electronic health records, and data (diagnoses and questionnaires) recorded during the visit. Results For prediction of any attempt following mental health specialty visits, a model limited to historical insurance claims data performed approximately as well (C-statistic 0.843) as a model using all available data (C-statistic 0.850). For prediction of suicide attempt following a general medical visit, addition of data recorded during the visit yielded a meaningful improvement over a model using all data up to the prior day (C-statistic 0.853 vs 0.838). Discussion Results may not generalize to setting with less comprehensive data or different patterns of care. Even the poorest-performing models were superior to brief self-report questionnaires or traditional clinical assessment. Conclusions Implementation of suicide risk prediction models in mental health specialty settings may be less technically demanding than expected. In general medical settings, however, delivery of optimal risk predictions at the point of care may require more sophisticated informatics capability.

2019 ◽  
Author(s):  
Otto Von Sperling ◽  
Marcelo Ladeira

The literature on computerized models that help detect, study and understand signs of mental health disor- ders from social media has been thriving since the mid-2000s for English speakers. In Brazil, this area of research shows promising results, in addition to a variety of niches that still need exploring. Thus, we construct a large corpus from 2941 users (1486 depressive, 1455 non-depressive), and induce machine learning models to identify signs of depression from our Twitter corpus. In order to achieve our goal, we extract features by measuring linguistic style, behavioral patterns, and affect from users’ public tweets and metadata. Resulting models successfully distinguish between depressive and non-depressive classes with performance scores comparable to results in the literature. We hope that our findings can become stepping stones towards more methodologies being applied at the service of mental health.


JAMIA Open ◽  
2021 ◽  
Vol 4 (3) ◽  
Author(s):  
Suparno Datta ◽  
Jan Philipp Sachs ◽  
Harry FreitasDa Cruz ◽  
Tom Martensen ◽  
Philipp Bode ◽  
...  

Abstract Objectives The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap remains concerning electronic health records (EHRs) stored in star schema clinical data warehouses, an approach often adopted in practice. In this article, we introduce the FlexIBle EHR Retrieval (FIBER) tool: a Python library built on top of a star schema (i2b2) clinical data warehouse that enables flexible generation of modeling-ready cohorts as data frames. Materials and Methods FIBER was developed on top of a large-scale star schema EHR database which contains data from 8 million patients and over 120 million encounters. To illustrate FIBER’s capabilities, we present its application by building a heart surgery patient cohort with subsequent prediction of acute kidney injury (AKI) with various machine learning models. Results Using FIBER, we were able to build the heart surgery cohort (n = 12 061), identify the patients that developed AKI (n = 1005), and automatically extract relevant features (n = 774). Finally, we trained machine learning models that achieved area under the curve values of up to 0.77 for this exemplary use case. Conclusion FIBER is an open-source Python library developed for extracting information from star schema clinical data warehouses and reduces time-to-modeling, helping to streamline the clinical modeling process.


2020 ◽  
Author(s):  
Maqsood Ahmad ◽  
Noorhaniza Wahid ◽  
Rahayu A Hamid ◽  
Saima Sadiq ◽  
Arif Mehmood ◽  
...  

BACKGROUND Mental health signifies the emotional, social, and psychological well-being of a person. It also affects the way of thinking, feeling, and situation handling of a person. A stable mental health helps in working with full potential in all stages of life from childhood to adulthood therefore it is of significant importance to find out the onset of the mental disease in order to maintain balance in life. The mental health problems are rising globally and constituting a burden on health-care systems. Early diagnosis can help the professionals in the treatment that may lead to complications if they remain untreated. The machine learning models are highly prevalent for medical data analysis, disease diagnosis, and psychiatric nosology. OBJECTIVE This research addresses the challenge of detecting six major psychological disorders, namely, Anxiety, Bipolar Disorder, Conversion Disorder, Depression, Mental Retardation, and Schizophrenia. These challenges are mined by applying decision level fusion of supervised machine learning algorithms. METHODS observations that we used for training and testing the models. Furthermore, to reduce the impact of a conflicting decision, a voting scheme Shrewd Probing Prediction Model (SPPM) is introduced to get output from ensemble model of Random Forest and Gradient Boosting Machine (RF+GBM). RESULTS The proposed model generated the Term Frequency – Inverse Document Frequency (TF-IDF)-based average accuracy, precision, recall and F1 score of 67% thus outperforming other machine learning models namely, Random Forest (RF), Gradient Boosting Machine (GBM), Logistic Regression (LR) and Support Vector Machines (SVM). CONCLUSIONS This research provides an intuitive solution for mental disorder analysis among different target class labels or groups. A framework is proposed for determining the mental health problem of patients using observations of medical experts. The framework consists of an ensemble model based on RF and GBM with a novel SPPM technique, namely SPPM (RF+GBM). This proposed decision level fusion approach significantly improves the performance in terms of Accuracy, Recall, and F1-score with 67%, 66%, and 67% respectively. This framework seems suitable in the case of huge and more diverse multi-class datasets. Furthermore, three vector spaces based on TF-IDF (unigram, bi-gram, and tri-gram) are also tested on the machine learning models and the proposed model. Experiments revealed that unigram performed better on the experimental dataset. In the future, more physiological parameters such as respiratory rate, ECG, and EEG signals can be included as features to improve accuracy. Also, the proposed framework can be tested on a wide range of mental illness categories by adding more mental illness diseases in the dataset which will result in an increase of class labels.


Stroke ◽  
2017 ◽  
Vol 48 (suppl_1) ◽  
Author(s):  
Charles Esenwa ◽  
Jorge Luna ◽  
Benjamin Kummer ◽  
Hojjat Salmasian ◽  
Hooman Kamel ◽  
...  

Introduction: Stroke research using widely available institutional, state-wide and national retrospective data is dependent on accurate identification of stroke subtypes using claims data. Despite the abundance of such data and the advances in clinical informatics, there is limited published data on the application of machine learning models to improve previously reported administrative stroke identification algorithms. Hypothesis: We hypothesized that machine learning models can be applied to claims data coded using the International Classification of Disease, version 9 (ICD-9), to accuracy identify patients with ischemic stroke (IS), intracerebral hemorrhage (ICH), and subarachnoid hemorrhage (SAH), and these models would outperform previously published algorithms in our patient cohort. Methods: We developed a gold standard list of 427 stroke patients continuously admitted to our institution from 1/1/2015 to 9/30/2015 using an internal stroke database and applied 75% of it to train and 25% to test two machine learning models: one using classification and regression tree (CART) and another using regularized logistic regression. There were 2,241 negative controls. We further applied a previously reported stroke detection algorithm, by Tirschwell and Longstreth, to our cohort for comparison. Results: The CART model had a κ of 0.72, 0.82, 0.59; sensitivity of 95%, 99%, 99%; and a specificity of 88%, 78%, 75%; for IS, ICH and SAH respectively. The regularized logistic regression model had a κ of 0.73, 0.80, 0.59; sensitivity of 95%, 99%, 99%, and a specificity of 89%, 78%, 75%; for IS, ICH and SAH respectively. The previously reported algorithm by Tirschwell et al, had a κ of 0.71,0.56, 0.64; sensitivity of 98%, 99%, 99%; and a specificity of 64%, 52%, 50%; for IS, ICH and SAH. Conclusion: Compared with the previously reported ICD 9 based detection algorithm, the machine learning models had a higher κ for diagnosis of IS and ICH, similar sensitivity for all subtypes, and higher specificity for all stroke subtypes in our cohort. Applying machine learning models to identify stroke subtypes from administrative data sets, can lead to highly accurate models of stroke subtype identification for health services researchers.


Stroke ◽  
2017 ◽  
Vol 48 (suppl_1) ◽  
Author(s):  
Charles Esenwa ◽  
Jorge Luna ◽  
Benjamin Kummer ◽  
Hojjat Salmasian ◽  
David Vawdrey ◽  
...  

Introduction: Retrospective identification of patients hospitalized with new diagnosis of acute ischemic stroke is important for administrative quality assurance, post-discharge clinical management, and stroke research. The benefit of using administrative claims data is its widespread availability, but the disadvantage is in the inability to accurately and consistently identify the clinical diagnosis of interest. Hypothesis: We hypothesized that decision tree and logistic regression models could be applied to administrative claims data coded using International Classification of Diseases, version 10 (ICD-10) to create algorithms that could accurately identify patients with acute ischemic stroke. Methods: We used hospital records from our institution to develop a gold standard list of 243 patients, continuously hospitalized with a new diagnosis of stroke from 10/1/2015 to 3/31/2016. We used 1,393 neurological patients without a diagnosis of stroke as negative controls. This list was used to train and test two machine learning methods of diagnosis and procedure codes analysis, for the purpose of ischemic stroke identification: one using classification and regression tree (CART) and another using regularized logistic regression. We trained the models using 75% of the data and performed the evaluation using the remaining 25%. Results: The CART model had a κ=0.78, sensitivity of 96%, specificity of 90%, and a positive predictive value of 99%. The regularized logistic regression model had a κ=0.73, sensitivity of 97%, specificity of 81%, and a positive predictive value of 98%. Conclusion: Both the decision tree and logistic regression machine based learning models showed very high accuracy in identifying patients with a new diagnosis of ischemic stroke, using ICD-10 code claims data, when compared to our gold standard. Applying these machine learning models to identify patients with ischemic stroke has widespread applications, especially in this period where national billing data has transitioned from ICD-9 to ICD-10 codes.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Jiaxin Fan ◽  
Mengying Chen ◽  
Jian Luo ◽  
Shusen Yang ◽  
Jinming Shi ◽  
...  

Abstract Background Screening carotid B-mode ultrasonography is a frequently used method to detect subjects with carotid atherosclerosis (CAS). Due to the asymptomatic progression of most CAS patients, early identification is challenging for clinicians, and it may trigger ischemic stroke. Recently, machine learning has shown a strong ability to classify data and a potential for prediction in the medical field. The combined use of machine learning and the electronic health records of patients could provide clinicians with a more convenient and precise method to identify asymptomatic CAS. Methods Retrospective cohort study using routine clinical data of medical check-up subjects from April 19, 2010 to November 15, 2019. Six machine learning models (logistic regression [LR], random forest [RF], decision tree [DT], eXtreme Gradient Boosting [XGB], Gaussian Naïve Bayes [GNB], and K-Nearest Neighbour [KNN]) were used to predict asymptomatic CAS and compared their predictability in terms of the area under the receiver operating characteristic curve (AUCROC), accuracy (ACC), and F1 score (F1). Results Of the 18,441 subjects, 6553 were diagnosed with asymptomatic CAS. Compared to DT (AUCROC 0.628, ACC 65.4%, and F1 52.5%), the other five models improved prediction: KNN + 7.6% (0.704, 68.8%, and 50.9%, respectively), GNB + 12.5% (0.753, 67.0%, and 46.8%, respectively), XGB + 16.0% (0.788, 73.4%, and 55.7%, respectively), RF + 16.6% (0.794, 74.5%, and 56.8%, respectively) and LR + 18.1% (0.809, 74.7%, and 59.9%, respectively). The highest achieving model, LR predicted 1045/1966 cases (sensitivity 53.2%) and 3088/3566 non-cases (specificity 86.6%). A tenfold cross-validation scheme further verified the predictive ability of the LR. Conclusions Among machine learning models, LR showed optimal performance in predicting asymptomatic CAS. Our findings set the stage for an early automatic alarming system, allowing a more precise allocation of CAS prevention measures to individuals probably to benefit most.


2020 ◽  
Author(s):  
Janmajay Singh ◽  
Masahiro Sato ◽  
Tomoko Ohkuma

BACKGROUND Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. OBJECTIVE The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. METHODS A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. RESULTS Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. CONCLUSIONS This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.


Sign in / Sign up

Export Citation Format

Share Document