Neonatal mortality prediction with routinely collected data: a machine learning approach

Abstract Background Recent decreases in neonatal mortality have been slower than expected for most countries. This study aims to predict the risk of neonatal mortality using only data routinely available from birth records in the largest city of the Americas. Methods A probabilistic linkage of every birth record occurring in the municipality of São Paulo, Brazil, between 2012 e 2017 was performed with the death records from 2012 to 2018 (1,202,843 births and 447,687 deaths), and a total of 7282 neonatal deaths were identified (a neonatal mortality rate of 6.46 per 1000 live births). Births from 2012 and 2016 (N = 941,308; or 83.44% of the total) were used to train five different machine learning algorithms, while births occurring in 2017 (N = 186,854; or 16.56% of the total) were used to test their predictive performance on new unseen data. Results The best performance was obtained by the extreme gradient boosting trees (XGBoost) algorithm, with a very high AUC of 0.97 and F1-score of 0.55. The 5% births with the highest predicted risk of neonatal death included more than 90% of the actual neonatal deaths. On the other hand, there were no deaths among the 5% births with the lowest predicted risk. There were no significant differences in predictive performance for vulnerable subgroups. The use of a smaller number of variables (WHO’s five minimum perinatal indicators) decreased overall performance but the results still remained high (AUC of 0.91). With the addition of only three more variables, we achieved the same predictive performance (AUC of 0.97) as using all the 23 variables originally available from the Brazilian birth records. Conclusion Machine learning algorithms were able to identify with very high predictive performance the neonatal mortality risk of newborns using only routinely collected data.

Download Full-text

A machine learning-based predictor for the identification of the recurrence of patients with gastric cancer after operation

Scientific Reports ◽

10.1038/s41598-021-81188-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chengmao Zhou ◽

Junhong Hu ◽

Ying Wang ◽

Mu-Huo Ji ◽

Jianhua Tong ◽

...

Keyword(s):

Machine Learning ◽

Gastric Cancer ◽

Learning Algorithms ◽

Test Group ◽

Operation Time ◽

Predictive Performance ◽

Original Data ◽

Postoperative Recurrence ◽

Machine Learning Algorithms ◽

Gastric Cancer Patients

AbstractTo explore the predictive performance of machine learning on the recurrence of patients with gastric cancer after the operation. The available data is divided into two parts. In particular, the first part is used as a training set (such as 80% of the original data), and the second part is used as a test set (the remaining 20% of the data). And we use fivefold cross-validation. The weight of recurrence factors shows the top four factors are BMI, Operation time, WGT and age in order. In training group:among the 5 machine learning models, the accuracy of gbm was 0.891, followed by gbm algorithm was 0.876; The AUC values of the five machine learning algorithms are from high to low as forest (0.962), gbm (0.922), GradientBoosting (0.898), DecisionTree (0.790) and Logistic (0.748). And the precision of the forest is the highest 0.957, followed by the GradientBoosting algorithm (0.878). At the same time, in the test group is as follows: the highest accuracy of Logistic was 0.801, followed by forest algorithm and gbm; the AUC values of the five algorithms are forest (0.795), GradientBoosting (0.774), DecisionTree (0.773), Logistic (0.771) and gbm (0.771), from high to low. Among the five machine learning algorithms, the highest precision rate of Logistic is 1.000, followed by the gbm (0.487). Machine learning can predict the recurrence of gastric cancer patients after an operation. Besides, the first four factors affecting postoperative recurrence of gastric cancer were BMI, Operation time, WGT and age.

Download Full-text

Detection of Slums from Very High-Resolution Satellite Images Using Machine Learning Algorithms: A Case Study of Fustat Area in Cairo, Egypt

Proceedings of International Exchange and Innovation Conference on Engineering & Sciences (IEICES) ◽

10.5109/4102491 ◽

2020 ◽

Vol 6 ◽

pp. 219-224

Author(s):

Muhammad Salem ◽

Naoki Tsurusaki ◽

Ahmed Eissa ◽

Taher Osman

Keyword(s):

Machine Learning ◽

High Resolution ◽

Satellite Images ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

High Resolution Satellite Images ◽

Very High

Download Full-text

Comparison of Machine Learning Algorithms in the Interpolation and Extrapolation of Flame Describing Functions

Volume 4B: Combustion, Fuels, and Emissions ◽

10.1115/gt2019-91319 ◽

2019 ◽

Author(s):

Michael McCartney ◽

Matthias Haeringer ◽

Wolfgang Polifke

Keyword(s):

Machine Learning ◽

Gaussian Processes ◽

Spline Interpolation ◽

Learning Algorithms ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Test Time ◽

Minimal Amount ◽

Data Points ◽

The Impact

Abstract This paper examines and compares commonly used Machine Learning algorithms in their performance in interpolation and extrapolation of FDFs, based on experimental and simulation data. Algorithm performance is evaluated by interpolating and extrapolating FDFs and then the impact of errors on the limit cycle amplitudes are evaluated using the xFDF framework. The best algorithms in interpolation and extrapolation were found to be the widely used cubic spline interpolation, as well as the Gaussian Processes regressor. The data itself was found to be an important factor in defining the predictive performance of a model, therefore a method of optimally selecting data points at test time using Gaussian Processes was demonstrated. The aim of this is to allow a minimal amount of data points to be collected while still providing enough information to model the FDF accurately. The extrapolation performance was shown to decay very quickly with distance from the domain and so emphasis should be put on selecting measurement points in order to expand the covered domain. Gaussian Processes also give an indication of confidence on its predictions and is used to carry out uncertainty quantification, in order to understand model sensitivities. This was demonstrated through application to the xFDF framework.

Download Full-text

Identifying the Main Risk Factors for Cardiovascular Diseases Prediction Using Machine Learning Algorithms

Mathematics ◽

10.3390/math9202537 ◽

2021 ◽

Vol 9 (20) ◽

pp. 2537

Author(s):

Luis Rolando Guarneros-Nolasco ◽

Nancy Aracely Cruz-Ramos ◽

Giner Alor-Hernández ◽

Lisbeth Rodríguez-Mazahua ◽

José Luis Sánchez-Cervantes

Keyword(s):

Machine Learning ◽

Cardiovascular Diseases ◽

Performance Metrics ◽

Learning Algorithms ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Algorithm Performance ◽

Body Regions ◽

Risks Factors ◽

Fold Cross Validation

Cardiovascular Diseases (CVDs) are a leading cause of death globally. In CVDs, the heart is unable to deliver enough blood to other body regions. As an effective and accurate diagnosis of CVDs is essential for CVD prevention and treatment, machine learning (ML) techniques can be effectively and reliably used to discern patients suffering from a CVD from those who do not suffer from any heart condition. Namely, machine learning algorithms (MLAs) play a key role in the diagnosis of CVDs through predictive models that allow us to identify the main risks factors influencing CVD development. In this study, we analyze the performance of ten MLAs on two datasets for CVD prediction and two for CVD diagnosis. Algorithm performance is analyzed on top-two and top-four dataset attributes/features with respect to five performance metrics –accuracy, precision, recall, f1-score, and roc-auc—using the train-test split technique and k-fold cross-validation. Our study identifies the top-two and top-four attributes from CVD datasets analyzing the performance of the accuracy metrics to determine that they are the best for predicting and diagnosing CVD. As our main findings, the ten ML classifiers exhibited appropriate diagnosis in classification and predictive performance with accuracy metric with top-two attributes, identifying three main attributes for diagnosis and prediction of a CVD such as arrhythmia and tachycardia; hence, they can be successfully implemented for improving current CVD diagnosis efforts and help patients around the world, especially in regions where medical staff is lacking.

Download Full-text

Tree Species Classification in Mixed Deciduous Forests Using Very High Spatial Resolution Satellite Imagery and Machine Learning Methods

Remote Sensing ◽

10.3390/rs12233926 ◽

2020 ◽

Vol 12 (23) ◽

pp. 3926

Author(s):

Martina Deur ◽

Mateo Gašparović ◽

Ivan Balenović

Keyword(s):

Machine Learning ◽

Satellite Imagery ◽

Tree Species ◽

Learning Algorithms ◽

Spectral Characteristics ◽

Machine Learning Algorithms ◽

Support Vector ◽

Species Classification ◽

Tree Species Classification ◽

Very High

Spatially explicit information on tree species composition is important for both the forest management and conservation sectors. In combination with machine learning algorithms, very high-resolution satellite imagery may provide an effective solution to reduce the need for labor-intensive and time-consuming field-based surveys. In this study, we evaluated the possibility of using multispectral WorldView-3 (WV-3) satellite imagery for the classification of three main tree species (Quercus robur L., Carpinus betulus L., and Alnus glutinosa (L.) Geartn.) in a lowland, mixed deciduous forest in central Croatia. The pixel-based supervised classification was performed using two machine learning algorithms: random forest (RF) and support vector machine (SVM). Additionally, the contribution of gray level cooccurrence matrix (GLCM) texture features from WV-3 imagery in tree species classification was evaluated. Principal component analysis confirmed GLCM variance to be the most significant texture feature. Of the 373 visually interpreted reference polygons, 237 were used as training polygons and 136 were used as validation polygons. The validation results show relatively high overall accuracy (85%) for tree species classification based solely on WV-3 spectral characteristics and the RF classification approach. As expected, an improvement in classification accuracy was achieved by a combination of spectral and textural features. With the additional use of GLCM variance, the overall accuracy improved by 10% and 7% for RF and SVM classification approaches, respectively.

Download Full-text

Predicting the Authenticity of Banknotes Using Supervised Learning

American Journal of Advanced Computing ◽

10.15864/ajac.1204 ◽

2020 ◽

Vol 1 (2) ◽

pp. 1-4

Author(s):

Priyam Guha ◽

Abhishek Mukherjee ◽

Abhishek Verma

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Confusion Matrix ◽

Learning Algorithms ◽

High Accuracy ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

False Negatives ◽

Supervised Learning Algorithms ◽

Very High

This research paper deals with using supervised machine learning algorithms to detect authenticity of bank notes. In this research we were successful in achieving very high accuracy (of the order of 99%) by applying some data preprocessing tricks and then running the processed data on supervised learning algorithms like SVM, Decision Trees, Logistic Regression, KNN. We then proceed to analyze the misclassified points. We examine the confusion matrix to find out which algorithms had more number of false positives and which algorithm had more number of False negatives. This research paper deals with using supervised machine learning algorithms to detect authenticity of bank notes. In this research we were successful in achieving very high accuracy (of the order of 99%) by applying some data preprocessing tricks and then running the processed data on supervised learning algorithms like SVM, Decision Trees, Logistic Regression, KNN. We then proceed to analyze the misclassified points. We examine the confusion matrix to find out which algorithms had more number of false positives and which algorithm had more number of False negatives.

Download Full-text

Identifying the Main Risk Factors for CVD Prediction Using Machine Learning Algorithms

10.20944/preprints202108.0471.v1 ◽

2021 ◽

Author(s):

Luis Rolando Guarneros-Nolasco ◽

Nancy Aracely Cruz-Ramos ◽

Giner Alor-Hernández ◽

Lisbeth Rodríguez-Mazahua ◽

José Luis Sánchez-Cervantes

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Performance Metrics ◽

Learning Algorithms ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Algorithm Performance ◽

Body Regions ◽

Risks Factors ◽

Fold Cross Validation

CVDs are a leading cause of death globally. In CVDs, the heart is unable to deliver enough blood to other body regions. Since effective and accurate diagnosis of CVDs is essential for CVD prevention and treatment, machine learning (ML) techniques can be effectively and reliably used to discern patients suffering from a CVD from those who do not suffer from any heart condition. Namely, machine learning algorithms (MLAs) play a key role in the diagnosis of CVDs through predictive models that allow us to identify the main risks factors influencing CVD development. In this study, we analyze the performance of ten MLAs on two datasets for CVD prediction and two for CVD diagnosis. Algorithm performance is analyzed on top-two and top-four dataset attributes/features with respect to five performance metrics –accuracy, precision, recall, f1-score, and roc-auc – using the train-test split technique and k-fold cross-validation. Our study identifies the top two and four attributes from each CVD diagnosis/prediction dataset. As our main findings, the ten MLAs exhibited appropriate diagnosis and predictive performance; hence, they can be successfully implemented for improving current CVD diagnosis efforts and help patients around the world, especially in regions where medical staff is lacking.

Download Full-text

27 Machine Learning Algorithms Based on Haplotype Libraries for Classification of Stillbirth Susceptibility in Holstein Cows

Journal of Animal Science ◽

10.1093/jas/skab235.025 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 15-16

Author(s):

Pablo A S Fonseca ◽

Massimo Tornatore ◽

Angela Cánovas

Keyword(s):

Machine Learning ◽

Genome Wide Association Study ◽

Linear Models ◽

Predictive Accuracy ◽

Area Under The Curve ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Economic Losses ◽

Nucleotide Polymorphisms ◽

Birth Records

Abstract Reduced fertility is one of the main causes of economic losses in dairy farms. The cost of a stillbirth is estimated in US$ 938 per case in Holstein herds. Machine learning (ML) is gaining popularity in the livestock sector as a mean to identify hidden patterns and due to its potential to address dimensionality problems. Here we investigate the application of ML algorithms for the prediction of cows with higher stillbirth susceptibility in two scenarios: cows with >25% and >33.33% of stillbirths among birth records. These thresholds correspond to percentiles 75 (still_75) and 90 (still_90), respectively. A total of 10,570 cows and 50,541 birth records were collected to perform a haplotype-based genome-wide association study. Five-hundred significant pseudo single nucleotide polymorphisms (pseudo-SNPs) (False-Discovery Rate< 0.05) were used as input features of ML-based predictions to determine if the cow is in the top-75 and top-90 percentiles. Table 1 shows the classification performance of the investigated ML and linear models. The ML models outperformed linear models for both thresholds. In general, still_75 showed higher F1 values compared to still_90, suggesting a lower misclassification ratio when a less stringent threshold is used. We observe that accuracy of the models in our study is higher when compared to ML-based prediction accuracies in other breeds, e.g. compared to the accuracies of 0.46 and 0.67 that were achieved using SNPs for body weight in Brahman and fertility traits in Nellore, respectively. Xgboost algorithm shows the highest balanced accuracy (BA; 0.625), F1-score (0.588) and area under the curve (AUC; 0.688), suggesting that xgboost can achieve the highest predictive performance and the lowest difference in misclassification ratio between classes. The ML applied over haplotype libraries is an interesting approach for the detection of animals with higher susceptibility to stillbirths due to highest predictive accuracy and relatively lower misclassification ratio.

Download Full-text

Machine learning algorithms using routinely collected data do not adequately predict viremia to inform targeted services in postpartum women living with HIV

JAIDS Journal of Acquired Immune Deficiency Syndromes ◽

10.1097/qai.0000000000002800 ◽

2021 ◽

Vol Publish Ahead of Print ◽

Author(s):

Pamela M. Murnane ◽

James Ayieko ◽

Eric Vittinghoff ◽

Monica Gandhi ◽

Chaplain Katumbi ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Postpartum Women ◽

Women Living With Hiv ◽

Routinely Collected Data ◽

Living With Hiv

Download Full-text

Prediction of Dansgaard-Oeschger events using machine learning

10.5194/egusphere-egu21-9699 ◽

2021 ◽

Author(s):

Nuno Moniz ◽

Susana Barbosa

Keyword(s):

Machine Learning ◽

Time Series ◽

Prediction Models ◽

Learning Algorithms ◽

Ice Core ◽

Model Performance ◽

Predictive Performance ◽

Oxygen Isotopic Composition ◽

Machine Learning Algorithms ◽

Support Vector

<p>The Dansgaard-Oeschger (DO) events are one of the most striking examples of abrupt climate change in the Earth's history, representing temperature oscillations of about 8 to 16 degrees Celsius within a few decades. DO events have been studied extensively in paleoclimatic records, particularly in ice core proxies. Examples include the Greenland NGRIP record of oxygen isotopic composition.<br>This work addresses the anticipation of DO events using machine learning algorithms. We consider the NGRIP time series from 20 to 60 kyr b2k with the GICC05 timescale and 20-year temporal resolution. Forecasting horizons range from 0 (nowcasting) to 400 years. We adopt three different machine learning algorithms (random forests, support vector machines, and logistic regression) in training windows of 5 kyr. We perform validation on subsequent test windows of 5 kyr, based on timestamps of previous DO events' classification in Greenland by Rasmussen et al. (2014). We perform experiments with both sliding and growing windows.<br>Results show that predictions on sliding windows are better overall, indicating that modelling is affected by non-stationary characteristics of the time series. The three algorithms' predictive performance is similar, with a slightly better performance of random forest models for shorter forecast horizons. The prediction models' predictive capability decreases as the forecasting horizon grows more extensive but remains reasonable up to 120 years. Model performance deprecation is mostly related to imprecision in accurately determining the start and end time of events and identifying some periods as DO events when such is not valid.</p>

Download Full-text