Assessment of machine-learning techniques on large pathology data sets to address assay redundancy in routine liver function test profiles

AbstractRoutine liver function tests (LFTs) are central to serum testing profiles, particularly in community medicine. However there is concern about the redundancy of information provided to requesting clinicians. Large quantities of clinical laboratory data and advances in computational knowledge discovery methods provide opportunities to re-examine the value of individual routine laboratory results that combine for LFT profiles.The machine learning methods recursive partitioning (decision trees) and support vector machines (SVMs) were applied to aggregate clinical chemistry data that included elevated LFT profiles. Response categories for γ-glutamyl transferase (GGT) were established based on whether the patient results were within or above the sex-specific reference interval. Single decision tree and SVMs were applied to test the accuracy of GGT prediction by the highest ranked predictors of GGT response, alkaline phosphatase (ALP) and alanine amino-transaminase (ALT).Through interrogating more than 20,000 individual cases comprising both sexes and all ages, decision trees predicted GGT category at 90% accuracy using only ALP and ALT, with a SVM prediction accuracy of 82.6% after 10-fold training and testing. Bilirubin, lactate dehydrogenase (LD) and albumin did not enhance prediction, or reduced accuracy. Comparison of abnormal (elevated) GGT categories also supported the primacy of ALP and ALT as screening markers, with serum urate and cholesterol also useful.Machine-learning interrogation of massive clinical chemistry data sets demonstrated a strategy to address redundancy in routine LFT screening by identifying ALT and ALP in tandem as able to accurately predict GGT elevation, suggesting that GGT can be removed from routine LFT screening.

Download Full-text

A One-shot Learning Approach to Image Classification using Genetic Programming

10.26686/wgtn.13150934.v1 ◽

2020 ◽

Author(s):

Harith Al-Sahaf ◽

Mengjie Zhang ◽

M Johnston

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Genetic Programming ◽

Image Classification ◽

Local Binary Patterns ◽

Support Vector ◽

Learning Approach ◽

Data Sets ◽

Domain Specific ◽

International Publishing

In machine learning, it is common to require a large number of instances to train a model for classification. In many cases, it is hard or expensive to acquire a large number of instances. In this paper, we propose a novel genetic programming (GP) based method to the problem of automatic image classification via adopting a one-shot learning approach. The proposed method relies on the combination of GP and Local Binary Patterns (LBP) techniques to detect a predefined number of informative regions that aim at maximising the between-class scatter and minimising the within-class scatter. Moreover, the proposed method uses only two instances of each class to evolve a classifier. To test the effectiveness of the proposed method, four different texture data sets are used and the performance is compared against two other GP-based methods namely Conventional GP and Two-tier GP. The experiments revealed that the proposed method outperforms these two methods on all the data sets. Moreover, a better performance has been achieved by Naïve Bayes, Support Vector Machine, and Decision Trees (J48) methods when extracted features by the proposed method have been used compared to the use of domain-specific and Two-tier GP extracted features. © Springer International Publishing 2013.

Download Full-text

Developing a Process for the Analysis of User Journeys and the Prediction of Dropout in Digital Health Interventions: Machine Learning Approach (Preprint)

10.2196/preprints.17738 ◽

2020 ◽

Author(s):

Vincent Bremer ◽

Philip I Chow ◽

Burkhardt Funk ◽

Frances P Thorndike ◽

Lee M Ritterband

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Behavioral Therapy ◽

Digital Health ◽

Area Under The Curve ◽

Prediction Performance ◽

Health Interventions ◽

Drop Out ◽

Support Vector ◽

Boosted Decision Trees

BACKGROUND User dropout is a widespread concern in the delivery and evaluation of digital (ie, web and mobile apps) health interventions. Researchers have yet to fully realize the potential of the large amount of data generated by these technology-based programs. Of particular interest is the ability to predict who will drop out of an intervention. This may be possible through the analysis of user journey data—self-reported as well as system-generated data—produced by the path (or journey) an individual takes to navigate through a digital health intervention. OBJECTIVE The purpose of this study is to provide a step-by-step process for the analysis of user journey data and eventually to predict dropout in the context of digital health interventions. The process is applied to data from an internet-based intervention for insomnia as a way to illustrate its use. The completion of the program is contingent upon completing 7 sequential cores, which include an initial tutorial core. Dropout is defined as not completing the seventh core. METHODS Steps of user journey analysis, including data transformation, feature engineering, and statistical model analysis and evaluation, are presented. Dropouts were predicted based on data from 151 participants from a fully automated web-based program (Sleep Healthy Using the Internet) that delivers cognitive behavioral therapy for insomnia. Logistic regression with L1 and L2 regularization, support vector machines, and boosted decision trees were used and evaluated based on their predictive performance. Relevant features from the data are reported that predict user dropout. RESULTS Accuracy of predicting dropout (area under the curve [AUC] values) varied depending on the program core and the machine learning technique. After model evaluation, boosted decision trees achieved AUC values ranging between 0.6 and 0.9. Additional handcrafted features, including time to complete certain steps of the intervention, time to get out of bed, and days since the last interaction with the system, contributed to the prediction performance. CONCLUSIONS The results support the feasibility and potential of analyzing user journey data to predict dropout. Theory-driven handcrafted features increased the prediction performance. The ability to predict dropout at an individual level could be used to enhance decision making for researchers and clinicians as well as inform dynamic intervention regimens.

Download Full-text

Identification of DNA-binding proteins via Hypergraph based Laplacian Support Vector Machine

Current Bioinformatics ◽

10.2174/1574893616666210806091922 ◽

2021 ◽

Vol 16 ◽

Author(s):

Yuqing Qian ◽

Hao Meng ◽

Weizhong Lu ◽

Zhijun Liao ◽

Yijie Ding ◽

...

Keyword(s):

Machine Learning ◽

Dna Binding ◽

Large Scale ◽

Binding Proteins ◽

Predictive Accuracy ◽

Dna Binding Proteins ◽

Research Field ◽

Support Vector ◽

Data Sets ◽

Independent Test

Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP. Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-based machine learning model to predict DBP. Methods: In our study, we extract six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We use Multiple Kernel Learning based on Hilbert-Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we construct a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets. Result: Compared with other methods, our model achieves best results on benchmark data sets. Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.

Download Full-text

Parallel Tuning of Support Vector Machine Learning Parameters for Large and Unbalanced Data Sets

Lecture Notes in Computer Science - Computational Life Sciences ◽

10.1007/11560500_23 ◽

2005 ◽

pp. 253-264 ◽

Cited By ~ 5

Author(s):

Tatjana Eitrich ◽

Bruno Lang

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Support Vector ◽

Unbalanced Data ◽

Data Sets

Download Full-text

Clinical chemistry in higher dimensions: Machine-learning and enhanced prediction from routine clinical chemistry data

Clinical Biochemistry ◽

10.1016/j.clinbiochem.2016.07.013 ◽

2016 ◽

Vol 49 (16-17) ◽

pp. 1213-1220 ◽

Cited By ~ 10

Author(s):

Alice Richardson ◽

Ben M. Signor ◽

Brett A. Lidbury ◽

Tony Badrick

Keyword(s):

Machine Learning ◽

Clinical Chemistry ◽

Higher Dimensions ◽

Chemistry Data

Download Full-text

Artificial Intelligence in the Intensive Care Unit

Seminars in Respiratory and Critical Care Medicine ◽

10.1055/s-0040-1719037 ◽

2020 ◽

Author(s):

Massimiliano Greco ◽

Pier F. Caruso ◽

Maurizio Cecconi

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Intensive Care ◽

Language Processing ◽

Laboratory Data ◽

Kidney Injury ◽

Support Vector ◽

Learning Models ◽

Electronic Health ◽

Natural Terrain

AbstractThe diffusion of electronic health records collecting large amount of clinical, monitoring, and laboratory data produced by intensive care units (ICUs) is the natural terrain for the application of artificial intelligence (AI). AI has a broad definition, encompassing computer vision, natural language processing, and machine learning, with the latter being more commonly employed in the ICUs. Machine learning may be divided in supervised learning models (i.e., support vector machine [SVM] and random forest), unsupervised models (i.e., neural networks [NN]), and reinforcement learning. Supervised models require labeled data that is data mapped by human judgment against predefined categories. Unsupervised models, on the contrary, can be used to obtain reliable predictions even without labeled data. Machine learning models have been used in ICU to predict pathologies such as acute kidney injury, detect symptoms, including delirium, and propose therapeutic actions (vasopressors and fluids in sepsis). In the future, AI will be increasingly used in ICU, due to the increasing quality and quantity of available data. Accordingly, the ICU team will benefit from models with high accuracy that will be used for both research purposes and clinical practice. These models will be also the foundation of future decision support system (DSS), which will help the ICU team to visualize and analyze huge amounts of information. We plea for the creation of a standardization of a core group of data between different electronic health record systems, using a common dictionary for data labeling, which could greatly simplify sharing and merging of data from different centers.

Download Full-text

Predicting Fault Slip via Transfer Learning

10.21203/rs.3.rs-700852/v1 ◽

2021 ◽

Author(s):

Kun Wang ◽

Christopher Johnson ◽

Kane Bennett ◽

Paul Johnson

Keyword(s):

Machine Learning ◽

Numerical Simulations ◽

Transfer Learning ◽

Laboratory Experiments ◽

Laboratory Data ◽

Fault Slip ◽

Geophysical Data ◽

Training Data ◽

Data Sets ◽

Earthquake Cycle

Abstract Data-driven machine-learning for predicting instantaneous and future fault-slip in laboratory experiments has recently progressed markedly due to large training data sets. In Earth however, earthquake interevent times range from 10's-100's of years and geophysical data typically exist for only a portion of an earthquake cycle. Sparse data presents a serious challenge to training machine learning models. Here we describe a transfer learning approach using numerical simulations to train a convolutional encoder-decoder that predicts fault-slip behavior in laboratory experiments. The model learns a mapping between acoustic emission histories and fault-slip from numerical simulations, and generalizes to produce accurate results using laboratory data. Notably slip-predictions markedly improve using the simulation-data trained-model and training the latent space using a portion of a single laboratory earthquake-cycle. The transfer learning results elucidate the potential of using models trained on numerical simulations and fine-tuned with small geophysical data sets for potential applications to faults in Earth.

Download Full-text

Preoperative prediction of complicated appendicitis using machine learning method

10.21203/rs.3.rs-27341/v1 ◽

2020 ◽

Author(s):

Chunbo Kang ◽

Xubin Li ◽

Xiaoqian Chi ◽

Yabin Yang ◽

Haifeng Shan ◽

...

Keyword(s):

Machine Learning ◽

Clinical Symptoms ◽

Laboratory Data ◽

Learning Model ◽

Complicated Appendicitis ◽

Support Vector ◽

Training Set ◽

Preoperative Prediction ◽

Machine Learning Model ◽

Testing Set

Abstract BACKGROUND Accurate preoperative prediction of complicated appendicitis (CA) could help selecting optimal treatment and reducing risks of postoperative complications. The study aimed to develop a machine learning model based on clinical symptoms and laboratory data for preoperatively predicting CA.METHODS 136 patients with clinicopathological diagnosis of acute appendicitis were retrospectively included in the study. The dataset was randomly divided (94: 42) into training and testing set. Predictive models using individual and combined selected clinical and laboratory data features were built separately. Three combined models were constructed using logistic regression (LR), support vector machine (SVM) and random forest (RF) algorithms. The CA prediction performance was evaluated with Receiver Operating Characteristic (ROC) analysis, using the area under the curve (AUC), sensitivity, specificity and accuracy factors.RESULTS The features of the abdominal pain time, nausea and vomiting, the highest temperature, high sensitivity-CRP (hs-CRP) and procalcitonin (PCT) had significant differences in the CA prediction (P<0.001). The ability to predict CA by individual feature was low (AUC<0.8). The prediction by combined features was significantly improved. The AUC of the three models (LR, SVM and RF) in the training set and the testing set were 0.805, 0.888, 0.908 and 0.794, 0.895, 0.761, respectively. The SVM-based model showed a better performance for CA prediction. RF had a higher AUC in the training set, but its poor efficiency in the testing set indicated a poor generalization ability.CONCLUSIONS The SVM machine learning model applying clinical and laboratory data can well predict CA preoperatively which could assist diagnosis in resource limited settings.

Download Full-text

Using Machine Learning to Aid the Interpretation of Urine Steroid Profiles

Clinical Chemistry ◽

10.1373/clinchem.2018.292201 ◽

2018 ◽

Vol 64 (11) ◽

pp. 1586-1595 ◽

Cited By ~ 10

Author(s):

Edmund H Wilkes ◽

Gill Rumsby ◽

Gary M Woodward

Keyword(s):

Machine Learning ◽

Clinical Laboratory ◽

Laboratory Data ◽

Large Data ◽

Data Sets ◽

Clinical Practices ◽

Steroid Profiles ◽

The Individual ◽

Multiclass Classifier ◽

Urine Steroid

Abstract BACKGROUND Urine steroid profiles are used in clinical practice for the diagnosis and monitoring of disorders of steroidogenesis and adrenal pathologies. Machine learning (ML) algorithms are powerful computational tools used extensively for the recognition of patterns in large data sets. Here, we investigated the utility of various ML algorithms for the automated biochemical interpretation of urine steroid profiles to support current clinical practices. METHODS Data from 4619 urine steroid profiles processed between June 2012 and October 2016 were retrospectively collected. Of these, 1314 profiles were used to train and test various ML classifiers' abilities to differentiate between “No significant abnormality” and “?Abnormal” profiles. Further classifiers were trained and tested for their ability to predict the specific biochemical interpretation of the profiles. RESULTS The best performing binary classifier could predict the interpretation of No significant abnormality and ?Abnormal profiles with a mean area under the ROC curve of 0.955 (95% CI, 0.949–0.961). In addition, the best performing multiclass classifier could predict the individual abnormal profile interpretation with a mean balanced accuracy of 0.873 (0.865–0.880). CONCLUSIONS Here we have described the application of ML algorithms to the automated interpretation of urine steroid profiles. This provides a proof-of-concept application of ML algorithms to complex clinical laboratory data that has the potential to improve laboratory efficiency in a setting of limited staff resources.

Download Full-text

Comparison of the Validity and Generalizability of Machine Learning Algorithms for the Prediction of Energy Expenditure: Validation Study

JMIR mhealth and uhealth ◽

10.2196/23938 ◽

2021 ◽

Vol 9 (8) ◽

pp. e23938

Author(s):

Ruairi O'Driscoll ◽

Jake Turicchi ◽

Mark Hopkins ◽

Cristiana Duarte ◽

Graham W Horgan ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Energy Expenditure ◽

Superior Performance ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Out Of Sample

Background Accurate solutions for the estimation of physical activity and energy expenditure at scale are needed for a range of medical and health research fields. Machine learning techniques show promise in research-grade accelerometers, and some evidence indicates that these techniques can be applied to more scalable commercial devices. Objective This study aims to test the validity and out-of-sample generalizability of algorithms for the prediction of energy expenditure in several wearables (ie, Fitbit Charge 2, ActiGraph GT3-x, SenseWear Armband Mini, and Polar H7) using two laboratory data sets comprising different activities. Methods Two laboratory studies (study 1: n=59, age 44.4 years, weight 75.7 kg; study 2: n=30, age=31.9 years, weight=70.6 kg), in which adult participants performed a sequential lab-based activity protocol consisting of resting, household, ambulatory, and nonambulatory tasks, were combined in this study. In both studies, accelerometer and physiological data were collected from the wearables alongside energy expenditure using indirect calorimetry. Three regression algorithms were used to predict metabolic equivalents (METs; ie, random forest, gradient boosting, and neural networks), and five classification algorithms (ie, k-nearest neighbor, support vector machine, random forest, gradient boosting, and neural networks) were used for physical activity intensity classification as sedentary, light, or moderate to vigorous. Algorithms were evaluated using leave-one-subject-out cross-validations and out-of-sample validations. Results The root mean square error (RMSE) was lowest for gradient boosting applied to SenseWear and Polar H7 data (0.91 METs), and in the classification task, gradient boost applied to SenseWear and Polar H7 was the most accurate (85.5%). Fitbit models achieved an RMSE of 1.36 METs and 78.2% accuracy for classification. Errors tended to increase in out-of-sample validations with the SenseWear neural network achieving RMSE values of 1.22 METs in the regression tasks and the SenseWear gradient boost and random forest achieving an accuracy of 80% in classification tasks. Conclusions Algorithms trained on combined data sets demonstrated high predictive accuracy, with a tendency for superior performance of random forests and gradient boosting for most but not all wearable devices. Predictions were poorer in the between-study validations, which creates uncertainty regarding the generalizability of the tested algorithms.

Download Full-text