scholarly journals Dementia risks identified by vocal features via telephone conversations: A novel machine learning prediction model

PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0253988
Author(s):  
Akihiro Shimoda ◽  
Yue Li ◽  
Hana Hayashi ◽  
Naoki Kondo

Due to difficulty in early diagnosis of Alzheimer’s disease (AD) related to cost and differentiated capability, it is necessary to identify low-cost, accessible, and reliable tools for identifying AD risk in the preclinical stage. We hypothesized that cognitive ability, as expressed in the vocal features in daily conversation, is associated with AD progression. Thus, we have developed a novel machine learning prediction model to identify AD risk by using the rich voice data collected from daily conversations, and evaluated its predictive performance in comparison with a classification method based on the Japanese version of the Telephone Interview for Cognitive Status (TICS-J). We used 1,465 audio data files from 99 Healthy controls (HC) and 151 audio data files recorded from 24 AD patients derived from a dementia prevention program conducted by Hachioji City, Tokyo, between March and May 2020. After extracting vocal features from each audio file, we developed machine-learning models based on extreme gradient boosting (XGBoost), random forest (RF), and logistic regression (LR), using each audio file as one observation. We evaluated the predictive performance of the developed models by describing the receiver operating characteristic (ROC) curve, calculating the areas under the curve (AUCs), sensitivity, and specificity. Further, we conducted classifications by considering each participant as one observation, computing the average of their audio files’ predictive value, and making comparisons with the predictive performance of the TICS-J based questionnaire. Of 1,616 audio files in total, 1,308 (81.0%) were randomly allocated to the training data and 308 (19.1%) to the validation data. For audio file-based prediction, the AUCs for XGboost, RF, and LR were 0.863 (95% confidence interval [CI]: 0.794–0.931), 0.882 (95% CI: 0.840–0.924), and 0.893 (95%CI: 0.832–0.954), respectively. For participant-based prediction, the AUC for XGboost, RF, LR, and TICS-J were 1.000 (95%CI: 1.000–1.000), 1.000 (95%CI: 1.000–1.000), 0.972 (95%CI: 0.918–1.000) and 0.917 (95%CI: 0.918–1.000), respectively. There was difference in predictive accuracy of XGBoost and TICS-J with almost approached significance (p = 0.065). Our novel prediction model using the vocal features of daily conversations demonstrated the potential to be useful for the AD risk assessment.

2020 ◽  
Vol 12 (23) ◽  
pp. 3925
Author(s):  
Ivan Pilaš ◽  
Mateo Gašparović ◽  
Alan Novkinić ◽  
Damir Klobučar

The presented study demonstrates a bi-sensor approach suitable for rapid and precise up-to-date mapping of forest canopy gaps for the larger spatial extent. The approach makes use of Unmanned Aerial Vehicle (UAV) red, green and blue (RGB) images on smaller areas for highly precise forest canopy mask creation. Sentinel-2 was used as a scaling platform for transferring information from the UAV to a wider spatial extent. Various approaches to an improvement in the predictive performance were examined: (I) the highest R2 of the single satellite index was 0.57, (II) the highest R2 using multiple features obtained from the single-date, S-2 image was 0.624, and (III) the highest R2 on the multitemporal set of S-2 images was 0.697. Satellite indices such as Atmospherically Resistant Vegetation Index (ARVI), Infrared Percentage Vegetation Index (IPVI), Normalized Difference Index (NDI45), Pigment-Specific Simple Ratio Index (PSSRa), Modified Chlorophyll Absorption Ratio Index (MCARI), Color Index (CI), Redness Index (RI), and Normalized Difference Turbidity Index (NDTI) were the dominant predictors in most of the Machine Learning (ML) algorithms. The more complex ML algorithms such as the Support Vector Machines (SVM), Random Forest (RF), Stochastic Gradient Boosting (GBM), Extreme Gradient Boosting (XGBoost), and Catboost that provided the best performance on the training set exhibited weaker generalization capabilities. Therefore, a simpler and more robust Elastic Net (ENET) algorithm was chosen for the final map creation.


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. e16801-e16801
Author(s):  
Daniel R Cherry ◽  
Qinyu Chen ◽  
James Don Murphy

e16801 Background: Pancreatic cancer has an insidious presentation with four-in-five patients presenting with disease not amenable to potentially curative surgery. Efforts to screen patients for pancreatic cancer using population-wide strategies have proven ineffective. We applied a machine learning approach to create an early prediction model drawing on the content of patients’ electronic health records (EHRs). Methods: We used patient data from OptumLabs which included de-identified data extracted from patient EHRs collected between 2009 and 2017. We identified patients diagnosed with pancreatic cancer at age 40 or later, which we categorized into early-stage pancreatic cancer (ESPC; n = 3,322) and late-stage pancreatic cancer (LSPC; n = 25,908) groups. ESPC cases were matched to non-pancreatic cancer controls in a ratio of 1:16 based on diagnosis year and geographic division, and the cohort was divided into training (70%) and test (30%) sets. The prediction model was built using an eXtreme Gradient Boosting machine learning algorithm of ESPC patients’ EHRs in the year preceding diagnosis, with features including patient demographics, procedure and clinical diagnosis codes, clinical notes and medications. Model discrimination was assessed with sensitivity, specificity, positive predictive value (PPV) and area under the curve (AUC) with a score of 1.0 indicating perfect prediction. Results: The final AUC in the test set was 0.841, and the model included 583 features, of which 248 (42.5%) were physician note elements, 146 (25.0%) were procedure codes, 91 (15.6%) were diagnosis codes, 89 (15.3%) were medications and 9 (1.54%) were demographic features. The most important features were history of pancreatic disorders (not diabetes or cancer), age, income, biliary tract disease, education level, obstructive jaundice and abdominal pain. We evaluated model performance at varying classification thresholds. When applied to patients over 40 choosing a threshold with a sensitivity of 20% produced a specificity of 99.9% and a PPV of 2.5%. The model PPV increased with age; for patients over 80, PPV was 8.0%. LSPC patients identified by the model would have been detected a median of 4 months before their actual diagnosis, with a quarter of these patients identified at least 14 months earlier. Conclusions: Using EHR data to identify early-stage pancreatic cancer patients shows promise. While widespread use of this approach on an unselected population would produce high rates of false positives, this technique could be employed among high risk patients, or paired with other screening tools.


Author(s):  
Samir Bandyopadhyay ◽  
Shawni Dutta ◽  
Upasana Mukherjee

The novel coronavirus disease (COVID-19) has created immense threats to public health on various levels around the globe. The unpredictable outbreak of this disease and the pandemic situation are causing severe depression, anxiety and other mental as physical health related problems among the human beings. To combat against this disease, vaccination is essential as it will boost the immune system of human beings while being in the contact with the infected people. The vaccination process is thus necessary to confront the outbreak of COVID-19. This deadly disease has put social, economic condition of the entire world into an enormous challenge. The worldwide vaccination progress should be tracked to identify how fast the entire economic as well as social life will be stabilized. The monitor ofthe vaccination progress, a machine learning based Regressor model is approached in this study. This tracking process has been applied on the data starting from 14th December, 2020 to 24th April, 2021. A couple of ensemble based machine learning Regressor models such as Random Forest, Extra Trees, Gradient Boosting, AdaBoost and Extreme Gradient Boosting are implemented and their predictive performance are compared. The comparative study reveals that the AdaBoostRegressor outperforms with minimized mean absolute error (MAE) of 9.968 and root mean squared error (RMSE) of 11.133.


2021 ◽  
Vol 13 (23) ◽  
pp. 4832
Author(s):  
Patrick Schratz ◽  
Jannes Muenchow ◽  
Eugenia Iturritxa ◽  
José Cortés ◽  
Bernd Bischl ◽  
...  

This study analyzed highly correlated, feature-rich datasets from hyperspectral remote sensing data using multiple statistical and machine-learning methods. The effect of filter-based feature selection methods on predictive performance was compared. In addition, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated. Defoliation of trees (%), derived from in situ measurements from fall 2016, was modeled as a function of reflectance. Variable importance was assessed using permutation-based feature importance. Overall, the support vector machine (SVM) outperformed other algorithms, such as random forest (RF), extreme gradient boosting (XGBoost), and lasso (L1) and ridge (L2) regressions by at least three percentage points. The combination of certain feature sets showed small increases in predictive performance, while no substantial differences between individual feature sets were observed. For some combinations of learners and feature sets, filter methods achieved better predictive performances than using no feature selection. Ensemble filters did not have a substantial impact on performance. The most important features were located around the red edge. Additional features in the near-infrared region (800–1000 nm) were also essential to achieve the overall best performances. Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies. Nevertheless, more training data and replication in similar benchmarking studies are needed to be able to generalize the results.


Mathematics ◽  
2020 ◽  
Vol 8 (9) ◽  
pp. 1590
Author(s):  
Muhammad Syafrudin ◽  
Ganjar Alfian ◽  
Norma Latif Fitriyani ◽  
Muhammad Anshari ◽  
Tony Hadibarata ◽  
...  

Detecting self-care problems is one of important and challenging issues for occupational therapists, since it requires a complex and time-consuming process. Machine learning algorithms have been recently applied to overcome this issue. In this study, we propose a self-care prediction model called GA-XGBoost, which combines genetic algorithms (GAs) with extreme gradient boosting (XGBoost) for predicting self-care problems of children with disability. Selecting the feature subset affects the model performance; thus, we utilize GA to optimize finding the optimum feature subsets toward improving the model’s performance. To validate the effectiveness of GA-XGBoost, we present six experiments: comparing GA-XGBoost with other machine learning models and previous study results, a statistical significant test, impact analysis of feature selection and comparison with other feature selection methods, and sensitivity analysis of GA parameters. During the experiments, we use accuracy, precision, recall, and f1-score to measure the performance of the prediction models. The results show that GA-XGBoost obtains better performance than other prediction models and the previous study results. In addition, we design and develop a web-based self-care prediction to help therapist diagnose the self-care problems of children with disabilities. Therefore, appropriate treatment/therapy could be performed for each child to improve their therapeutic outcome.


2020 ◽  
Vol 8 ◽  
Author(s):  
Yasutaka Kuniyoshi ◽  
Haruka Tokutake ◽  
Natsuki Takahashi ◽  
Azusa Kamura ◽  
Sumie Yasuda ◽  
...  

We constructed an optimal machine learning (ML) method for predicting intravenous immunoglobulin (IVIG) resistance in children with Kawasaki disease (KD) using commonly available clinical and laboratory variables. We retrospectively collected 98 clinical records of hospitalized children with KD (2–109 months of age). We found that 20 (20%) children were resistant to initial IVIG therapy. We trained three ML techniques, including logistic regression, linear support vector machine, and eXtreme gradient boosting with 10 variables against IVIG resistance. Moreover, we estimated the predictive performance based on nested 5-fold cross-validation (CV). We also selected variables using the recursive feature elimination method and performed the nested 5-fold CV with selected variables in a similar manner. We compared ML models with the existing system regardless of their predictive performance. Results of the area under the receiver operator characteristic curve were in the range of 0.58–0.60 in the all-variable model and 0.60–0.75 in the select model. The specificities were more than 0.90 and higher than those in existing scoring systems, but the sensitivities were lower. Three ML models based on demographics and routine laboratory variables did not provide reliable performance. This is possibly the first study that has attempted to establish a better predictive model. Additional biomarkers are probably needed to generate an effective prediction model.


Author(s):  
Zhipeng Zhang ◽  
Kang Zhou ◽  
Xiang Liu

Abstract Broken rails are the most frequent cause of freight train derailments in the United States. According to the U.S. Federal Railroad Administration (FRA) railroad accident database, there are over 900 Class I railroad freight-train derailments caused by broken rails between 2000 and 2017. In 2017 alone, broken rail-caused freight train derailments cause $15.8 million track and rolling stock damage costs to Class I railroads. The prevention of broken rails is crucial for reducing the risk due to broken rail-caused derailments. Although there is fast-growing big data in the railroad industry, quite limited prior research has taken advantage of these data to disclose the relationship between real-world factors and broken rail occurrence. This article aims to predict the occurrence of broken rails via machine learning approach that simultaneously accounts for track files, traffic information, maintenance history, and prior defect information. In the prediction of broken rails, a machine learning-based algorithm called extreme gradient boosting (XGBoost) is developed with various types of variables, including track characteristics (e.g. rail profile information, rail laid information), traffic-related information (e.g. gross tonnage recorded by time, number of passing cars), maintenance records (e.g. rail grinding and track ballast cleaning), and historical rail defect records. Area Under the Curve (AUC) is used as the evaluation metric to identify the prediction accuracy of developed machine learning model. The preliminary result shows that the AUC for one year of the XGBoost-based prediction model is 0.83, which is higher than two comparative models, logistic regression and random forests. Furthermore, the feature importance discloses that segment length, traffic tonnage, number of car passes, rail age, and the number of detected defects in the past six months have relatively greater importance for the prediction of broken rails. The prediction model and outcomes, along with future research in the relationship between broken rails and broken rail-caused derailment, can benefit railroad practical maintenance planning and capital planning.


Diagnostics ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 943
Author(s):  
Joung Ouk (Ryan) Kim ◽  
Yong-Suk Jeong ◽  
Jin Ho Kim ◽  
Jong-Weon Lee ◽  
Dougho Park ◽  
...  

Background: This study proposes a cardiovascular diseases (CVD) prediction model using machine learning (ML) algorithms based on the National Health Insurance Service-Health Screening datasets. Methods: We extracted 4699 patients aged over 45 as the CVD group, diagnosed according to the international classification of diseases system (I20–I25). In addition, 4699 random subjects without CVD diagnosis were enrolled as a non-CVD group. Both groups were matched by age and gender. Various ML algorithms were applied to perform CVD prediction; then, the performances of all the prediction models were compared. Results: The extreme gradient boosting, gradient boosting, and random forest algorithms exhibited the best average prediction accuracy (area under receiver operating characteristic curve (AUROC): 0.812, 0.812, and 0.811, respectively) among all algorithms validated in this study. Based on AUROC, the ML algorithms improved the CVD prediction performance, compared to previously proposed prediction models. Preexisting CVD history was the most important factor contributing to the accuracy of the prediction model, followed by total cholesterol, low-density lipoprotein cholesterol, waist-height ratio, and body mass index. Conclusions: Our results indicate that the proposed health screening dataset-based CVD prediction model using ML algorithms is readily applicable, produces validated results and outperforms the previous CVD prediction models.


2020 ◽  
Vol 13 (S10) ◽  
Author(s):  
Minghui Liu ◽  
Jingyi Yang ◽  
Jiacheng Wang ◽  
Lei Deng

Abstract Background Studies have found that miRNAs play an important role in many biological activities involved in human diseases. Revealing the associations between miRNA and disease by biological experiments is time-consuming and expensive. The computational approaches provide a new alternative. However, because of the limited knowledge of the associations between miRNAs and diseases, it is difficult to support the prediction model effectively. Methods In this work, we propose a model to predict miRNA-disease associations, MDAPCOM, in which protein information associated with miRNAs and diseases is introduced to build a global miRNA-protein-disease network. Subsequently, diffusion features and HeteSim features, extracted from the global network, are combined to train the prediction model by eXtreme Gradient Boosting (XGBoost). Results The MDAPCOM model achieves AUC of 0.991 based on 10-fold cross-validation, which is significantly better than that of other two state-of-the-art methods RWRMDA and PRINCE. Furthermore, the model performs well on three unbalanced data sets. Conclusions The results suggest that the information behind proteins associated with miRNAs and diseases is crucial to the prediction of the associations between miRNAs and diseases, and the hybrid feature representation in the heterogeneous network is very effective for improving predictive performance.


Author(s):  
Tianhang Chen ◽  
Xiangeng Wang ◽  
Yanyi Chu ◽  
Dong-Qing Wei ◽  
Yi Xiong

AbstractType IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.


Sign in / Sign up

Export Citation Format

Share Document