Development and external validation of machine learning algorithms for postnatal gestational age estimation using clinical data and metabolomic markers

Accurate estimates of gestational age at birth are important for preterm birth surveillance but can be challenging to reliably acquire in low and middle income countries. Our objective was to develop machine learning models to accurately estimate gestational age shortly after birth using clinical and metabolic data. We derived and internally validated three models using ELASTIC NET multivariable linear regression in heel prick blood samples and clinical data from a retrospective cohort of newborns from Ontario, Canada. We conducted external model validation in heel prick and cord blood sample data collected from prospective birth cohorts in Lusaka, Zambia (N=311) and Matlab, Bangladesh (N=1176). The best-performing model accurately estimated gestational age within about 6 days of early pregnancy ultrasound estimates in both cohorts when applied to heel prick data (MAE (95% CI) = 0.79 weeks (0.69, 0.90) for Zambia; 0.81 weeks (0.75, 0.86) for Bangladesh), and within about 7 days when applied to cord blood data (1.02 weeks (0.90, 1.15) for Zambia; 0.95 weeks (0.90, 0.99) for Bangladesh). Algorithms developed in Canada provided accurate estimates of gestational age when applied to external cohorts from Zambia and Bangladesh. Model performance was superior in heel prick data as compared to cord blood data.

Download Full-text

External validation of postnatal gestational age estimation using newborn metabolic profiles in Matlab, Bangladesh

eLife ◽

10.7554/elife.42627 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 5

Author(s):

Malia SQ Murphy ◽

Steven Hawken ◽

Wei Cheng ◽

Lindsay A Wilson ◽

Monica Lamoureux ◽

...

Keyword(s):

Cord Blood ◽

Gestational Age ◽

Age Estimation ◽

External Validation ◽

Population Level ◽

Average Deviation ◽

Term Infants ◽

Data Set ◽

Heel Prick ◽

Gestational Age Estimation

This study sought to evaluate the performance of metabolic gestational age estimation models developed in Ontario, Canada in infants born in Bangladesh. Cord and heel prick blood spots were collected in Bangladesh and analyzed at a newborn screening facility in Ottawa, Canada. Algorithm-derived estimates of gestational age and preterm birth were compared to ultrasound-validated estimates. 1036 cord blood and 487 heel prick samples were collected from 1069 unique newborns. The majority of samples (93.2% of heel prick and 89.9% of cord blood) were collected from term infants. When applied to heel prick data, algorithms correctly estimated gestational age to within an average deviation of 1 week overall (root mean square error = 1.07 weeks). Metabolic gestational age estimation provides accurate population-level estimates of gestational age in this data set. Models were effective on data obtained from both heel prick and cord blood, the latter being a more feasible option in low-resource settings.

Download Full-text

Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01403-2 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Alan Brnabic ◽

Lisa M. Hess

Keyword(s):

Machine Learning ◽

Decision Making ◽

Literature Review ◽

Systematic Literature Review ◽

Real World ◽

Learning Algorithms ◽

External Validation ◽

Machine Learning Algorithms ◽

Learning Methods ◽

Machine Learning Methods

Abstract Background Machine learning is a broad term encompassing a number of methods that allow the investigator to learn from the data. These methods may permit large real-world databases to be more rapidly translated to applications to inform patient-provider decision making. Methods This systematic literature review was conducted to identify published observational research of employed machine learning to inform decision making at the patient-provider level. The search strategy was implemented and studies meeting eligibility criteria were evaluated by two independent reviewers. Relevant data related to study design, statistical methods and strengths and limitations were identified; study quality was assessed using a modified version of the Luo checklist. Results A total of 34 publications from January 2014 to September 2020 were identified and evaluated for this review. There were diverse methods, statistical packages and approaches used across identified studies. The most common methods included decision tree and random forest approaches. Most studies applied internal validation but only two conducted external validation. Most studies utilized one algorithm, and only eight studies applied multiple machine learning algorithms to the data. Seven items on the Luo checklist failed to be met by more than 50% of published studies. Conclusions A wide variety of approaches, algorithms, statistical software, and validation strategies were employed in the application of machine learning methods to inform patient-provider decision making. There is a need to ensure that multiple machine learning approaches are used, the model selection strategy is clearly defined, and both internal and external validation are necessary to be sure that decisions for patient care are being made with the highest quality evidence. Future work should routinely employ ensemble methods incorporating multiple machine learning algorithms.

Download Full-text

Elimination of Irrelevant Features and Heart Disease Recognition by Employing Machine Learning Algorithms using Clinical Data

2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) ◽

10.1109/iccwamtip51612.2020.9317351 ◽

2020 ◽

Author(s):

Huaiyu Wen ◽

Sufang Li ◽

Amin Ul Haq ◽

Jian Ping Li ◽

Rajesh Kumar ◽

...

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Clinical Data ◽

Learning Algorithms ◽

Machine Learning Algorithms

Download Full-text

Machine Learning Algorithms to Predict Recurrence within 10 Years after Breast Cancer Surgery: A Prospective Cohort Study

Cancers ◽

10.3390/cancers12123817 ◽

2020 ◽

Vol 12 (12) ◽

pp. 3817

Author(s):

Shi-Jer Lou ◽

Ming-Feng Hou ◽

Hong-Tai Chang ◽

Chong-Chi Chiu ◽

Hao-Hsien Lee ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Learning Algorithms ◽

External Validation ◽

Model Development ◽

Cancer Surgery ◽

Machine Learning Algorithms ◽

Breast Cancer Surgery ◽

Training Dataset

No studies have discussed machine learning algorithms to predict recurrence within 10 years after breast cancer surgery. This study purposed to compare the accuracy of forecasting models to predict recurrence within 10 years after breast cancer surgery and to identify significant predictors of recurrence. Registry data for breast cancer surgery patients were allocated to a training dataset (n = 798) for model development, a testing dataset (n = 171) for internal validation, and a validating dataset (n = 171) for external validation. Global sensitivity analysis was then performed to evaluate the significance of the selected predictors. Demographic characteristics, clinical characteristics, quality of care, and preoperative quality of life were significantly associated with recurrence within 10 years after breast cancer surgery (p < 0.05). Artificial neural networks had the highest prediction performance indices. Additionally, the surgeon volume was the best predictor of recurrence within 10 years after breast cancer surgery, followed by hospital volume and tumor stage. Accurate recurrence within 10 years prediction by machine learning algorithms may improve precision in managing patients after breast cancer surgery and improve understanding of risk factors for recurrence within 10 years after breast cancer surgery.

Download Full-text

Predictive model for acute respiratory distress syndrome events in ICU patients in China using machine learning algorithms: a secondary analysis of a cohort study

Journal of Translational Medicine ◽

10.1186/s12967-019-2075-0 ◽

2019 ◽

Vol 17 (1) ◽

Cited By ~ 4

Author(s):

Xian-Fei Ding ◽

Jin-Bo Li ◽

Huo-Yan Liang ◽

Zong-Yu Wang ◽

Ting-Ting Jiao ◽

...

Keyword(s):

Machine Learning ◽

Acute Respiratory Distress Syndrome ◽

Cohort Study ◽

Respiratory Distress Syndrome ◽

Respiratory Distress ◽

Distress Syndrome ◽

External Validation ◽

Secondary Analysis ◽

Machine Learning Algorithms ◽

Chinese Patients

Abstract Background To develop a machine learning model for predicting acute respiratory distress syndrome (ARDS) events through commonly available parameters, including baseline characteristics and clinical and laboratory parameters. Methods A secondary analysis of a multi-centre prospective observational cohort study from five hospitals in Beijing, China, was conducted from January 1, 2011, to August 31, 2014. A total of 296 patients at risk for developing ARDS admitted to medical intensive care units (ICUs) were included. We applied a random forest approach to identify the best set of predictors out of 42 variables measured on day 1 of admission. Results All patients were randomly divided into training (80%) and testing (20%) sets. Additionally, these patients were followed daily and assessed according to the Berlin definition. The model obtained an average area under the receiver operating characteristic (ROC) curve (AUC) of 0.82 and yielded a predictive accuracy of 83%. For the first time, four new biomarkers were included in the model: decreased minimum haematocrit, glucose, and sodium and increased minimum white blood cell (WBC) count. Conclusions This newly established machine learning-based model shows good predictive ability in Chinese patients with ARDS. External validation studies are necessary to confirm the generalisability of our approach across populations and treatment practices.

Download Full-text

REDIAL-2020: A Suite of Machine Learning Models to Estimate Anti-SARS-CoV-2 Activities

10.26434/chemrxiv.12915779.v2 ◽

2020 ◽

Author(s):

Govinda KC ◽

Giovanni Bocci ◽

Srijan Verma ◽

Mahmudulla Hassan ◽

Jayme Holmes ◽

...

Keyword(s):

Machine Learning ◽

High Throughput Screening ◽

Web Application ◽

External Validation ◽

Model Development ◽

Machine Learning Algorithms ◽

Virus Infectivity ◽

Learning Models ◽

Live Virus ◽

Machine Learning Models

Strategies for drug discovery and repositioning are an urgent need with respect to COVID-19. We developed "REDIAL-2020", a suite of machine learning models for estimating small molecule activity from molecular structure, for a range of SARS-CoV-2 related assays. Each classifier is based on three distinct types of descriptors (fingerprint, physicochemical, and pharmacophore) for parallel model development. These models were trained using high throughput screening data from the NCATS COVID19 portal (https://opendata.ncats.nih.gov/covid19/index.html), with multiple categorical machine learning algorithms. The “best models” are combined in an ensemble consensus predictor that outperforms single models where external validation is available. This suite of machine learning models is available through the DrugCentral web portal (<a href="https://drugdiscovery.utep.edu/redial">http://drugcentral.org/Redial</a>). Acceptable input formats are: drug name, PubChem CID, or SMILES; the output is an estimate of anti-SARS-CoV-2 activities. The web application reports estimated activity across three areas (viral entry, viral replication, and live virus infectivity) spanning six independent models, followed by a similarity search that displays the most similar molecules to the query among experimentally determined data. The ML models have 60% to 74% external predictivity, based on three separate datasets. Complementing the NCATS COVID19 portal, REDIAL-2020 can serve as a rapid online tool for identifying active molecules for COVID-19 treatment. The source code and specific models are available through Github (<a href="https://github.com/sirimullalab/ncats_covid">https://github.com/sirimullalab/</a>redial-2020), or via Docker Hub (https://hub.docker.com/r/sirimullalab/redial-2020) for users preferring a containerized version.

Download Full-text

Development and External Validation of a Machine Learning Tool to Rule Out COVID-19 Among Adults in the Emergency Department Using Routine Blood Tests: A Large, Multicenter, Real-World Study (Preprint)

10.2196/preprints.24048 ◽

2020 ◽

Author(s):

Timothy B Plante ◽

Aaron M Blau ◽

Adrian N Berg ◽

Aaron S Weinberg ◽

Ik C Jun ◽

...

Keyword(s):

Machine Learning ◽

Emergency Department ◽

Clinical Data ◽

External Validation ◽

Blood Tests ◽

Routine Blood ◽

Machine Learning Model ◽

Laboratory Results ◽

Negative Controls ◽

Rule Out

BACKGROUND Conventional diagnosis of COVID-19 with reverse transcription polymerase chain reaction (RT-PCR) testing (hereafter, PCR) is associated with prolonged time to diagnosis and significant costs to run the test. The SARS-CoV-2 virus might lead to characteristic patterns in the results of widely available, routine blood tests that could be identified with machine learning methodologies. Machine learning modalities integrating findings from these common laboratory test results might accelerate ruling out COVID-19 in emergency department patients. OBJECTIVE We sought to develop (ie, train and internally validate with cross-validation techniques) and externally validate a machine learning model to rule out COVID 19 using only routine blood tests among adults in emergency departments. METHODS Using clinical data from emergency departments (EDs) from 66 US hospitals before the pandemic (before the end of December 2019) or during the pandemic (March-July 2020), we included patients aged ≥20 years in the study time frame. We excluded those with missing laboratory results. Model training used 2183 PCR-confirmed cases from 43 hospitals during the pandemic; negative controls were 10,000 prepandemic patients from the same hospitals. External validation used 23 hospitals with 1020 PCR-confirmed cases and 171,734 prepandemic negative controls. The main outcome was COVID 19 status predicted using same-day routine laboratory results. Model performance was assessed with area under the receiver operating characteristic (AUROC) curve as well as sensitivity, specificity, and negative predictive value (NPV). RESULTS Of 192,779 patients included in the training, external validation, and sensitivity data sets (median age decile 50 [IQR 30-60] years, 40.5% male [78,249/192,779]), AUROC for training and external validation was 0.91 (95% CI 0.90-0.92). Using a risk score cutoff of 1.0 (out of 100) in the external validation data set, the model achieved sensitivity of 95.9% and specificity of 41.7%; with a cutoff of 2.0, sensitivity was 92.6% and specificity was 59.9%. At the cutoff of 2.0, the NPVs at a prevalence of 1%, 10%, and 20% were 99.9%, 98.6%, and 97%, respectively. CONCLUSIONS A machine learning model developed with multicenter clinical data integrating commonly collected ED laboratory data demonstrated high rule-out accuracy for COVID-19 status, and might inform selective use of PCR-based testing.

Download Full-text

Automated clinical computational biology: an interpretable machine learning framework to predict disease severity and stratify patients from clinical data

10.31219/osf.io/9xc2j ◽

2018 ◽

Author(s):

soumya banerjee

Keyword(s):

Machine Learning ◽

Disease Severity ◽

Clinical Data ◽

Model Building ◽

Learning Experience ◽

Machine Learning Algorithms ◽

Close Collaboration ◽

Learning Framework ◽

Novel Biomarkers ◽

Automated Machine Learning

We outline an automated computational and machine learning framework that predicts disease severity andstratifies patients. We apply our framework to available clinical data. Our algorithm automatically generatesinsights and predicts disease severity with minimal operator intervention. The computational frameworkpresented here can be used to stratify patients, predict disease severity and propose novel biomarkers fordisease. Insights from machine learning algorithms coupled with clinical data may help guide therapy,personalize treatment and help clinicians understand the change in disease over time. Computationaltechniques like these can be used in translational medicine in close collaboration with clinicians and healthcareproviders. Our models are also interpretable, allowing clinicians with minimal machine learning experience toengage in model building. This work is a step towards automated machine learning in the clinic.

Download Full-text

Predicting postoperative surgical site infection with administrative data: a random forests algorithm

BMC Medical Research Methodology ◽

10.1186/s12874-021-01369-9 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Yelena Petrosyan ◽

Kednapa Thavorn ◽

Glenys Smith ◽

Malcolm Maclure ◽

Roanne Preston ◽

...

Keyword(s):

Machine Learning ◽

Administrative Data ◽

Risk Score ◽

Random Forests ◽

Learning Algorithms ◽

External Validation ◽

Machine Learning Algorithms ◽

Improvement Program ◽

Health Administrative Data ◽

Administrative Datasets

Abstract Background Since primary data collection can be time-consuming and expensive, surgical site infections (SSIs) could ideally be monitored using routinely collected administrative data. We derived and internally validated efficient algorithms to identify SSIs within 30 days after surgery with health administrative data, using Machine Learning algorithms. Methods All patients enrolled in the National Surgical Quality Improvement Program from the Ottawa Hospital were linked to administrative datasets in Ontario, Canada. Machine Learning approaches, including a Random Forests algorithm and the high-performance logistic regression, were used to derive parsimonious models to predict SSI status. Finally, a risk score methodology was used to transform the final models into the risk score system. The SSI risk models were validated in the validation datasets. Results Of 14,351 patients, 795 (5.5%) had an SSI. First, separate predictive models were built for three distinct administrative datasets. The final model, including hospitalization diagnostic, physician diagnostic and procedure codes, demonstrated excellent discrimination (C statistics, 0.91, 95% CI, 0.90–0.92) and calibration (Hosmer-Lemeshow χ2 statistics, 4.531, p = 0.402). Conclusion We demonstrated that health administrative data can be effectively used to identify SSIs. Machine learning algorithms have shown a high degree of accuracy in predicting postoperative SSIs and can integrate and utilize a large amount of administrative data. External validation of this model is required before it can be routinely used to identify SSIs.

Download Full-text