An Interpretable Early Dynamic Sequential Predictor for Sepsis-Induced Coagulopathy Progression in the Real-World Using Machine Learning

Sepsis-associated coagulation dysfunction greatly increases the mortality of sepsis. Irregular clinical time-series data remains a major challenge for AI medical applications. To early detect and manage sepsis-induced coagulopathy (SIC) and sepsis-associated disseminated intravascular coagulation (DIC), we developed an interpretable real-time sequential warning model toward real-world irregular data. Eight machine learning models including novel algorithms were devised to detect SIC and sepsis-associated DIC 8n (1 ≤ n ≤ 6) hours prior to its onset. Models were developed on Xi'an Jiaotong University Medical College (XJTUMC) and verified on Beth Israel Deaconess Medical Center (BIDMC). A total of 12,154 SIC and 7,878 International Society on Thrombosis and Haemostasis (ISTH) overt-DIC labels were annotated according to the SIC and ISTH overt-DIC scoring systems in train set. The area under the receiver operating characteristic curve (AUROC) were used as model evaluation metrics. The eXtreme Gradient Boosting (XGBoost) model can predict SIC and sepsis-associated DIC events up to 48 h earlier with an AUROC of 0.929 and 0.910, respectively, and even reached 0.973 and 0.955 at 8 h earlier, achieving the highest performance to date. The novel ODE-RNN model achieved continuous prediction at arbitrary time points, and with an AUROC of 0.962 and 0.936 for SIC and DIC predicted 8 h earlier, respectively. In conclusion, our model can predict the sepsis-associated SIC and DIC onset up to 48 h in advance, which helps maximize the time window for early management by physicians.

Download Full-text

Highly accurate and explainable detection of specimen mix-up using a machine learning model

Clinical Chemistry and Laboratory Medicine (CCLM) ◽

10.1515/cclm-2019-0534 ◽

2020 ◽

Vol 58 (3) ◽

pp. 375-383 ◽

Cited By ~ 2

Author(s):

Tomohiro Mitani ◽

Shunsuke Doi ◽

Shinichiroh Yokota ◽

Takeshi Imai ◽

Kazuhiko Ohe

Keyword(s):

Machine Learning ◽

Time Series ◽

Time Series Data ◽

Characteristic Curve ◽

Window Size ◽

Series Data ◽

Gradient Boosting ◽

Time Range ◽

Biochemical Tests ◽

Cell Counts

AbstractBackgroundDelta check is widely used for detecting specimen mix-ups. Owing to the inadequate specificity and sparseness of the absolute incidence of mix-ups, the positive predictive value (PPV) of delta check is considerably low as it is labor consuming to identify true mix-up errors among a large number of false alerts. To overcome this problem, we developed a new accurate detection model through machine learning.MethodsInspired by delta check, we decided to conduct comparisons with the past examinations and broaden the time range. Fifteen common items were selected from complete blood cell counts and biochemical tests. We considered examinations in which ≥11 among the 15 items were measured simultaneously in our hospital; we created individual partial time-series data of the consecutive examinations with a sliding window size of 4. The last examinations of the partial time-series data were shuffled to generate artificial mix-up cases. After splitting the dataset into development and validation sets, we allowed a gradient-boosting-decision-tree (GBDT) model to learn using the development set to detect whether the last examination results of the partial time-series data were artificial mixed-up results. The model’s performance was evaluated on the validation set.ResultsThe area under the receiver operating characteristic curve (ROC AUC) of our model was 0.9983 (bootstrap confidence interval [bsCI]: 0.9983–0.9985).ConclusionsThe GBDT model was more effective in detecting specimen mix-up. The improved accuracy will enable more facilities to perform more efficient and centralized mix-up detection, leading to improved patient safety.

Download Full-text

Prediction and Analysis of Gold Prices using Ensemble Machine Learning Algorithms

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.36028 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 4367-4374

Author(s):

Gudipally Chandrashakar

Keyword(s):

Machine Learning ◽

Time Series ◽

Time Series Data ◽

Gold Price ◽

Machine Learning Algorithms ◽

Series Data ◽

Gradient Boosting ◽

Support Vector ◽

Average Value ◽

Ensemble Machine Learning

In this article, we used historical time series data up to the current day gold price. In this study of predicting gold price, we consider few correlating factors like silver price, copper price, standard, and poor’s 500 value, dollar-rupee exchange rate, Dow Jones Industrial Average Value. Considering the prices of every correlating factor and gold price data where dates ranging from 2008 January to 2021 February. Few algorithms of machine learning are used to analyze the time-series data are Random Forest Regression, Support Vector Regressor, Linear Regressor, ExtraTrees Regressor and Gradient boosting Regression. While seeing the results the Extra Tree Regressor algorithm gives the predicted value of gold prices more accurately.

Download Full-text

Analytical Derivation of Nonlinear Spectral Effects and 1/f Scaling Artifact in Signal Processing of Real-World Data

Neural Computation ◽

10.1162/neco_a_00979 ◽

2017 ◽

Vol 29 (7) ◽

pp. 2004-2020 ◽

Cited By ~ 1

Author(s):

Claudia Lainscsek ◽

Lyle E. Muller ◽

Aaron L. Sampson ◽

Terrence J. Sejnowski

Keyword(s):

Signal Processing ◽

Real World ◽

Time Series Data ◽

Time Window ◽

Low Frequency ◽

Series Data ◽

Infinite Length ◽

Scaling Exponent ◽

Sampled Data ◽

Frequency Components

In estimating the frequency spectrum of real-world time series data, we must violate the assumption of infinite-length, orthogonal components in the Fourier basis. While it is widely known that care must be taken with discretely sampled data to avoid aliasing of high frequencies, less attention is given to the influence of low frequencies with period below the sampling time window. Here, we derive an analytic expression for the side-lobe attenuation of signal components in the frequency domain representation. This expression allows us to detail the influence of individual frequency components throughout the spectrum. The first consequence is that the presence of low-frequency components introduces a 1/f[Formula: see text] component across the power spectrum, with a scaling exponent of [Formula: see text]. This scaling artifact could be composed of diffuse low-frequency components, which can render it difficult to detect a priori. Further, treatment of the signal with standard digital signal processing techniques cannot easily remove this scaling component. While several theoretical models have been introduced to explain the ubiquitous 1/f[Formula: see text] scaling component in neuroscientific data, we conjecture here that some experimental observations could be the result of such data analysis procedures.

Download Full-text

The prediction of asymptomatic carotid atherosclerosis with electronic health records: a comparative study of six machine learning models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01480-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jiaxin Fan ◽

Mengying Chen ◽

Jian Luo ◽

Shusen Yang ◽

Jinming Shi ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Carotid Atherosclerosis ◽

Characteristic Curve ◽

Gradient Boosting ◽

Learning Models ◽

Health Records ◽

Extreme Gradient Boosting ◽

Electronic Health ◽

Machine Learning Models

Abstract Background Screening carotid B-mode ultrasonography is a frequently used method to detect subjects with carotid atherosclerosis (CAS). Due to the asymptomatic progression of most CAS patients, early identification is challenging for clinicians, and it may trigger ischemic stroke. Recently, machine learning has shown a strong ability to classify data and a potential for prediction in the medical field. The combined use of machine learning and the electronic health records of patients could provide clinicians with a more convenient and precise method to identify asymptomatic CAS. Methods Retrospective cohort study using routine clinical data of medical check-up subjects from April 19, 2010 to November 15, 2019. Six machine learning models (logistic regression [LR], random forest [RF], decision tree [DT], eXtreme Gradient Boosting [XGB], Gaussian Naïve Bayes [GNB], and K-Nearest Neighbour [KNN]) were used to predict asymptomatic CAS and compared their predictability in terms of the area under the receiver operating characteristic curve (AUCROC), accuracy (ACC), and F1 score (F1). Results Of the 18,441 subjects, 6553 were diagnosed with asymptomatic CAS. Compared to DT (AUCROC 0.628, ACC 65.4%, and F1 52.5%), the other five models improved prediction: KNN + 7.6% (0.704, 68.8%, and 50.9%, respectively), GNB + 12.5% (0.753, 67.0%, and 46.8%, respectively), XGB + 16.0% (0.788, 73.4%, and 55.7%, respectively), RF + 16.6% (0.794, 74.5%, and 56.8%, respectively) and LR + 18.1% (0.809, 74.7%, and 59.9%, respectively). The highest achieving model, LR predicted 1045/1966 cases (sensitivity 53.2%) and 3088/3566 non-cases (specificity 86.6%). A tenfold cross-validation scheme further verified the predictive ability of the LR. Conclusions Among machine learning models, LR showed optimal performance in predicting asymptomatic CAS. Our findings set the stage for an early automatic alarming system, allowing a more precise allocation of CAS prevention measures to individuals probably to benefit most.

Download Full-text

A Comparative Analysis of Machine Learning Models for Prediction of Insurance Uptake in Kenya

10.20944/preprints202010.0186.v1 ◽

2020 ◽

Author(s):

Nelson Yego ◽

Juma Kasozi ◽

Joseph Nkrunziza

Keyword(s):

Machine Learning ◽

Random Forest ◽

Characteristic Curve ◽

Confusion Matrix ◽

Gradient Boosting ◽

Support Vector ◽

Sampled Data ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

The role of insurance in financial inclusion as well as in economic growth is immense. However, low uptake seems to impede the growth of the sector hence the need for a model that robustly predicts uptake of insurance among potential clients. In this research, we compared the performances of eight (8) machine learning models in predicting the uptake of insurance. The classifiers considered were Logistic Regression, Gaussian Naive Bayes, Support Vector Machines, K Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting Machines and Extreme Gradient boosting. The data used in the classification was from the 2016 Kenya FinAccess Household Survey. Comparison of performance was done for both upsampled and downsampled data due to data imbalance. For upsampled data, Random Forest classifier showed highest accuracy and precision compared to other classifiers but for down sampled data, gradient boosting was optimal. It is noteworthy that for both upsampled and downsampled data, tree-based classifiers were more robust than others in insurance uptake prediction. However, in spite of hyper-parameter optimization, the area under receiver operating characteristic curve remained highest for Random Forest as compared to other tree-based models. Also, the confusion matrix for Random Forest showed least false positives, and highest true positives hence could be construed as the most robust model for predicting the insurance uptake. Finally, the most important feature in predicting uptake was having a bank product hence bancassurance could be said to be a plausible channel of distribution of insurance products.

Download Full-text

Prediction of Chlorophyll-a Concentrations in the Nakdong River Using Machine Learning Methods

Water ◽

10.3390/w12061822 ◽

2020 ◽

Vol 12 (6) ◽

pp. 1822

Author(s):

Yuna Shin ◽

Taekgeun Kim ◽

Seoksu Hong ◽

Seulbi Lee ◽

EunJi Lee ◽

...

Keyword(s):

Machine Learning ◽

Chlorophyll A ◽

Series Data ◽

Gradient Boosting ◽

Support Vector ◽

Nakdong River ◽

Extreme Gradient Boosting ◽

Rolling Window ◽

The Nakdong River ◽

Recursive Prediction

Many studies have attempted to predict chlorophyll-a concentrations using multiple regression models and validating them with a hold-out technique. In this study commonly used machine learning models, such as Support Vector Regression, Bagging, Random Forest, Extreme Gradient Boosting (XGBoost), Recurrent Neural Network (RNN), and Long–Short-Term Memory (LSTM), are used to build a new model to predict chlorophyll-a concentrations in the Nakdong River, Korea. We employed 1–step ahead recursive prediction to reflect the characteristics of the time series data. In order to increase the prediction accuracy, the model construction was based on forward variable selection. The fitted models were validated by means of cumulative learning and rolling window learning, as opposed to the hold–out technique. The best results were obtained when the chlorophyll-a concentration was predicted by combining the RNN model with the rolling window learning method. The results suggest that the selection of explanatory variables and 1–step ahead recursive prediction in the machine learning model are important processes for improving its prediction performance.

Download Full-text

MRI Radiomic Features to Predict IDH1 Mutation Status in Gliomas: A Machine Learning Approach using Gradient Tree Boosting

International Journal of Molecular Sciences ◽

10.3390/ijms21218004 ◽

2020 ◽

Vol 21 (21) ◽

pp. 8004

Author(s):

Yu Sakai ◽

Chen Yang ◽

Shingo Kihira ◽

Nadejda Tsankova ◽

Fahad Khan ◽

...

Keyword(s):

Machine Learning ◽

Characteristic Curve ◽

Area Under The Curve ◽

Prognostic Indicator ◽

Idh1 Mutation ◽

Gradient Boosting ◽

Isocitrate Dehydrogenase 1 ◽

Test Set ◽

Mutation Status ◽

Extreme Gradient Boosting

In patients with gliomas, isocitrate dehydrogenase 1 (IDH1) mutation status has been studied as a prognostic indicator. Recent advances in machine learning (ML) have demonstrated promise in utilizing radiomic features to study disease processes in the brain. We investigate whether ML analysis of multiparametric radiomic features from preoperative Magnetic Resonance Imaging (MRI) can predict IDH1 mutation status in patients with glioma. This retrospective study included patients with glioma with known IDH1 status and preoperative MRI. Radiomic features were extracted from Fluid-Attenuated Inversion Recovery (FLAIR) and Diffusion-Weighted-Imaging (DWI). The dataset was split into training, validation, and testing sets by stratified sampling. Synthetic Minority Oversampling Technique (SMOTE) was applied to the training sets. eXtreme Gradient Boosting (XGBoost) classifiers were trained, and the hyperparameters were tuned. Receiver operating characteristic curve (ROC), accuracy, and f1-scores were collected. A total of 100 patients (age: 55 ± 15, M/F 60/40); with IDH1 mutant (n = 22) and IDH1 wildtype (n = 78) were included. The best performance was seen with a DWI-trained XGBoost model, which achieved ROC with Area Under the Curve (AUC) of 0.97, accuracy of 0.90, and f1-score of 0.75 on the test set. The FLAIR-trained XGBoost model achieved ROC with AUC of 0.95, accuracy of 0.90, f1-score of 0.75 on the test set. A model that was trained on combined FLAIR-DWI radiomic features did not provide incremental accuracy. The results show that a XGBoost classifier using multiparametric radiomic features derived from preoperative MRI can predict IDH1 mutation status with > 90% accuracy.

Download Full-text

Machine learning identifies girls with central precocious puberty based on multisource data

JAMIA Open ◽

10.1093/jamiaopen/ooaa063 ◽

2020 ◽

Author(s):

Liyan Pan ◽

Guangjian Liu ◽

Xiaojian Mao ◽

Huiying Liang

Keyword(s):

Machine Learning ◽

Precocious Puberty ◽

Laboratory Tests ◽

Medical Center ◽

Central Precocious Puberty ◽

Stimulation Test ◽

Youden Index ◽

Gradient Boosting ◽

Single Source ◽

Extreme Gradient Boosting

Abstract Objective The study aimed to develop simplified diagnostic models for identifying girls with central precocious puberty (CPP), without the expensive and cumbersome gonadotropin-releasing hormone (GnRH) stimulation test, which is the gold standard for CPP diagnosis. Materials and methods Female patients who had secondary sexual characteristics before 8 years old and had taken a GnRH analog (GnRHa) stimulation test at a medical center in Guangzhou, China were enrolled. Data from clinical visiting, laboratory tests, and medical image examinations were collected. We first extracted features from unstructured data such as clinical reports and medical images. Then, models based on each single-source data or multisource data were developed with Extreme Gradient Boosting (XGBoost) classifier to classify patients as CPP or non-CPP. Results The best performance achieved an area under the curve (AUC) of 0.88 and Youden index of 0.64 in the model based on multisource data. The performance of single-source models based on data from basal laboratory tests and the feature importance of each variable showed that the basal hormone test had the highest diagnostic value for a CPP diagnosis. Conclusion We developed three simplified models that use easily accessed clinical data before the GnRH stimulation test to identify girls who are at high risk of CPP. These models are tailored to the needs of patients in different clinical settings. Machine learning technologies and multisource data fusion can help to make a better diagnosis than traditional methods.

Download Full-text

Airport Arrival Flow Prediction considering Meteorological Factors Based on Deep-Learning Methods

Complexity ◽

10.1155/2020/6309272 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Zhao Yang ◽

Yifan Wang ◽

Jie Li ◽

Liming Liu ◽

Jiyang Ma ◽

...

Keyword(s):

Neural Network ◽

Time Series Data ◽

Short Term Memory ◽

Series Data ◽

Gradient Boosting ◽

Support Vector ◽

Model Parameters ◽

Neural Network Models ◽

Flow Prediction ◽

Extreme Gradient Boosting

This study presents a combined Long Short-Term Memory and Extreme Gradient Boosting (LSTM-XGBoost) method for flight arrival flow prediction at the airport. Correlation analysis is conducted between the historic arrival flow and input features. The XGBoost method is applied to identify the relative importance of various variables. The historic time-series data of airport arrival flow and selected features are taken as input variables, and the subsequent flight arrival flow is the output variable. The model parameters are sequentially updated based on the recently collected data and the new predicting results. It is found that the prediction accuracy is greatly improved by incorporating the meteorological features. The data analysis results indicate that the developed method can characterize well the dynamics of the airport arrival flow, thereby providing satisfactory prediction results. The prediction performance is compared with benchmark methods including backpropagation neural network, LSTM neural network, support vector machine, gradient boosting regression tree, and XGBoost. The results show that the proposed LSTM-XGBoost model outperforms baseline and state-of-the-art neural network models.

Download Full-text

A Novel Machine Learning Strategy for Prediction of Antihypertensive Peptides Derived from Food with High Efficiency

10.1101/2020.08.12.248955 ◽

2020 ◽

Author(s):

Liyang Wang ◽

Dantong Niu ◽

Xiaoya Wang ◽

Qun Shen ◽

Yong Xue

Keyword(s):

Machine Learning ◽

High Throughput ◽

High Efficiency ◽

Characteristic Curve ◽

Bovine Milk ◽

Structural Features ◽

Protein Docking ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

Antihypertensive Peptides

AbstractStrategies to screen antihypertensive peptides with high throughput and rapid speed will be doubtlessly contributed to the treatment of hypertension. The food-derived antihypertensive peptides can reduce blood pressure without side effects. In present study, a novel model based on Extreme Gradient Boosting (XGBoost) algorithm was developed using the primary structural features of the food-derived peptides, and its performance in the prediction of antihypertensive peptides was compared with the dominating machine learning models. To further reflect the reliability of the method in real situation, the optimized XGBoost model was utilized to predict the antihypertensive degree of k-mer peptides cutting from 6 key proteins in bovine milk and the peptide-protein docking technology was introduced to verify the findings. The results showed that the XGBoost model achieved outstanding performance with the accuracy of 0.9841 and the area under the receiver operating characteristic curve of 0.9428, which were better than the other models. Using the XGBoost model, the prediction of antihypertensive peptides derived from milk protein was consistent with the peptide-protein docking results, and was more efficient. Our results indicate that using XGBoost algorithm as a novel auxiliary tool is feasible for screening antihypertensive peptide derived from food with high throughput and high efficiency.

Download Full-text