scholarly journals Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Louise Bloch ◽  
Christoph M. Friedrich ◽  

Abstract Background For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification. Methods An ML workflow was developed and trained for a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. Results The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set. Conclusion The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements.

2021 ◽  
Author(s):  
Louise Bloch ◽  
Christoph M. Friedrich

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.


2019 ◽  
Author(s):  
Daniel Stamate ◽  
Min Kim ◽  
Petroula Proitsi ◽  
Sarah Westwood ◽  
Alison Baird ◽  
...  

AbstractINTRODUCTIONMachine learning (ML) may harbor the potential to capture the metabolic complexity in Alzheimer’s Disease (AD). Here we set out to test the performance of metabolites in blood to categorise AD when compared to CSF biomarkers.METHODSThis study analysed samples from 242 cognitively normal (CN) people and 115 with AD-type dementia utilizing plasma metabolites (n=883). Deep Learning (DL), Extreme Gradient Boosting (XGBoost) and Random Forest (RF) were used to differentiate AD from CN. These models were internally validated using Nested Cross Validation (NCV).RESULTSOn the test data, DL produced the AUC of 0.85 (0.80-0.89), XGBoost produced 0.88 (0.86-0.89) and RF produced 0.85 (0.83-0.87). By comparison, CSF measures of amyloid, p-tau and t-tau (together with age and gender) produced with XGBoost the AUC values of 0.78, 0.83 and 0.87, respectively.DISCUSSIONThis study showed that plasma metabolites have the potential to match the AUC of well-established AD CSF biomarkers in a relatively small cohort. Further studies in independent cohorts are needed to validate whether this specific panel of blood metabolites can separate AD from controls, and how specific it is for AD as compared with other neurodegenerative disorders


2021 ◽  
Author(s):  
Bojan Bogdanovic ◽  
Tome Eftimov ◽  
Monika Simjanoska

Abstract Background: Alzheimer's disease is still a field of research with lots of open questions. The complexity of the disease prevents the early diagnosis before visible symptoms regarding the individual's cognitive capabilities occur. This research presents an in-depth analysis of a huge data set encompassing medical, cognitive and lifestyle's measurements from more than 12,000 individuals. Several hypothesis were established whose validity has been questioned considering the obtained results.Methods: The importance of appropriate experimental design is highly stressed in the research. Thus, a sequence of methods for handling missing data, redundancy, data imbalance, and correlation analysis have been applied for appropriate preprocessing of the data set, and consequently Random Forest and XGBoost models have been trained and evaluated with special attention to the hyperparameters tuning. Both of the models were explained by using the Shapley values produced by the SHAP method.Results: XGBoost produced the best f1-score of 0.84 and as such is considered to be highly competitive among those published in the literature. This achievement, however, was not the main contribution of this paper. This research's goal was to perform global and local interpretability of both the intelligent models and derive valuable conclusions over the established hypothesis. Those methods led to a single scheme which presents either positive, or, negative influence of the values of each of the features whose importance has been confirmed by means of Shapley values. This scheme might be considered as additional source of knowledge for the physicians and other experts whose concern is the exact diagnosis of early stage of Alzheimer's disease.Conclusion: The conclusions derived from the intelligent models interpretability rejected all the established hypothesis. This research clearly showed the importance of Machine learning explainability approach that opens the black box and clearly unveils the relationships among the features and the diagnoses.


2021 ◽  
Vol 13 ◽  
Author(s):  
Yuan Sh ◽  
Benliang Liu ◽  
Jianhu Zhang ◽  
Ying Zhou ◽  
Zhiyuan Hu ◽  
...  

Background: There are no obvious clinical signs and symptoms in the early stages of Alzheimer’s disease (AD), and most patients usually have mild cognitive impairment (MCI) before diagnosis. Therefore, early diagnosis of AD is very critical. This paper mainly discusses the blood biomarkers of AD patients and uses machine learning methods to study the changes of blood transcriptome during the development of AD and to search for potential blood biomarkers for AD.Methods: Individualized blood mRNA expression data of 711 patients were downloaded from the GEO database, including the control group (CON) (238 patients), MCI (189 patients), and AD (284 patients). Firstly, we analyzed the subcellular localization, protein types and enrichment pathways of the differentially expressed mRNAs in each group, and established an artificial intelligence individualized diagnostic model. Furthermore, the XCell tool was used to analyze the blood mRNA expression data and obtain blood cell composition and quantitative data. Ratio characteristics were established for mRNA and XCell data. Feature engineering operations such as collinearity and importance analysis were performed on all features to obtain the best feature solicitation. Finally, four machine learning algorithms, including linear support vector machine (SVM), Adaboost, random forest and artificial neural network, were used to model the optimal feature combinations and evaluate their classification performance in the test set.Results: Through feature engineering screening, the best feature collection was obtained. Moreover, the artificial intelligence individualized diagnosis model established based on this method achieved a classification accuracy of 91.59% in the test set. The area under curve (AUC) of CON, MCI, and AD were 0.9746, 0.9536, and 0.9807, respectively.Conclusion: The results of cell homeostasis analysis suggested that the homeostasis of Natural killer T cell (NKT) might be related to AD, and the homeostasis of Granulocyte macrophage progenitor (GMP) might be one of the reasons for AD.


2016 ◽  
Vol 13 (5) ◽  
pp. 498-508 ◽  
Author(s):  
V. Vigneron ◽  
A. Kodewitz ◽  
A. M. Tome ◽  
S. Lelandais ◽  
E. Lang

Processes ◽  
2020 ◽  
Vol 8 (9) ◽  
pp. 1071
Author(s):  
Lucia Billeci ◽  
Asia Badolato ◽  
Lorenzo Bachi ◽  
Alessandro Tonacci

Alzheimer’s disease is notoriously the most common cause of dementia in the elderly, affecting an increasing number of people. Although widespread, its causes and progression modalities are complex and still not fully understood. Through neuroimaging techniques, such as diffusion Magnetic Resonance (MR), more sophisticated and specific studies of the disease can be performed, offering a valuable tool for both its diagnosis and early detection. However, processing large quantities of medical images is not an easy task, and researchers have turned their attention towards machine learning, a set of computer algorithms that automatically adapt their output towards the intended goal. In this paper, a systematic review of recent machine learning applications on diffusion tensor imaging studies of Alzheimer’s disease is presented, highlighting the fundamental aspects of each work and reporting their performance score. A few examined studies also include mild cognitive impairment in the classification problem, while others combine diffusion data with other sources, like structural magnetic resonance imaging (MRI) (multimodal analysis). The findings of the retrieved works suggest a promising role for machine learning in evaluating effective classification features, like fractional anisotropy, and in possibly performing on different image modalities with higher accuracy.


Author(s):  
M. Tanveer ◽  
B. Richhariya ◽  
R. U. Khan ◽  
A. H. Rashid ◽  
P. Khanna ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document