Machine Learning Model Validation for Early Stage Studies with Small Sample Sizes

Author(s):  
Robyn Larracy ◽  
Angkoon Phinyomark ◽  
Erik Scheme
2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Florent Le Borgne ◽  
Arthur Chatton ◽  
Maxime Léger ◽  
Rémi Lenain ◽  
Yohann Foucher

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.


2021 ◽  
Vol 12 ◽  
Author(s):  
Jianwei Xiao ◽  
Rongsheng Wang ◽  
Xu Cai ◽  
Zhizhong Ye

Rheumatoid arthritis (RA) is an incurable disease that afflicts 0.5–1.0% of the global population though it is less threatening at its early stage. Therefore, improved diagnostic efficiency and prognostic outcome are critical for confronting RA. Although machine learning is considered a promising technique in clinical research, its potential in verifying the biological significance of gene was not fully exploited. The performance of a machine learning model depends greatly on the features used for model training; therefore, the effectiveness of prediction might reflect the quality of input features. In the present study, we used weighted gene co-expression network analysis (WGCNA) in conjunction with differentially expressed gene (DEG) analysis to select the key genes that were highly associated with RA phenotypes based on multiple microarray datasets of RA blood samples, after which they were used as features in machine learning model validation. A total of six machine learning models were used to validate the biological significance of the key genes based on gene expression, among which five models achieved good performances [area under curve (AUC) >0.85], suggesting that our currently identified key genes are biologically significant and highly representative of genes involved in RA. Combined with other biological interpretations including Gene Ontology (GO) analysis, protein–protein interaction (PPI) network analysis, as well as inference of immune cell composition, our current study might shed a light on the in-depth study of RA diagnosis and prognosis.


2020 ◽  
Vol 160 ◽  
pp. 113661 ◽  
Author(s):  
Md. Martuza Ahamad ◽  
Sakifa Aktar ◽  
Md. Rashed-Al-Mahfuz ◽  
Shahadat Uddin ◽  
Pietro Liò ◽  
...  

PLoS ONE ◽  
2020 ◽  
Vol 15 (7) ◽  
pp. e0236092 ◽  
Author(s):  
Bence Ferdinandy ◽  
Linda Gerencsér ◽  
Luca Corrieri ◽  
Paula Perez ◽  
Dóra Újváry ◽  
...  

2020 ◽  
Author(s):  
Gargi Datta ◽  
Nabeeh A Hasan ◽  
Michael Strong ◽  
Sonia M Leach

Background: The increasing incidence of drug resistance in tuberculosis and other infectious diseases poses an escalating cause for concern, emphasizing the urgent need to devise robust computational and molecular methods identify drug resistant strains. Although machine learning-based approaches using whole-genome sequence data can facilitate the inference of drug resistance, current implementations do not optimally take advantage of information in public databases and are not robust for small sample sizes and mixed attribute types. Results: In this paper we introduce the Composite MetaDistance method, an approach for feature selection and classification of high-dimensional, unbalanced datasets with mixed attribute features from various data sources. We introduce a mixed-attribute, multi-view distance function to calculate distances between samples, with optimal handling of nominal features and different feature views. We also introduce a novel feature set for drug resistance prediction in Mycobacterium tuberculosis, using data from diverse sources. We compare the performance of Composite MetaDistance to multiple machine learning algorithms for Mycobacterium tuberculosis drug resistance prediction for three drugs. Composite MetaDistance consistently outperforms existing algorithms for small sample training sets, and performs as well as other algorithms for training sets with larger sample sizes. Conclusion: The feature set formulation introduced in this paper is utilizes mutational and publicly available information for each gene, and is much richer than ever devised previously. The prediction algorithm, Composite MetaDistance, is sample size agnostic and robust especially given small sample sizes. Proper handling of nominal features improves performance even with a very small number of nominal features. We expect Composite MetaDistance to be even more robust for datasets with a higher percentage of nominal features. The algorithm is application independent and can be used for any mixed attribute dataset.


Sign in / Sign up

Export Citation Format

Share Document