scholarly journals Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

Author(s):  
Louise Bloch ◽  
Christoph M. Friedrich

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Diagnostics ◽  
2021 ◽  
Vol 11 (5) ◽  
pp. 887
Author(s):  
Jorge I. Vélez ◽  
Luiggi A. Samper ◽  
Mauricio Arcos-Holzinger ◽  
Lady G. Espinosa ◽  
Mario A. Isaza-Ruget ◽  
...  

Machine learning (ML) algorithms are widely used to develop predictive frameworks. Accurate prediction of Alzheimer’s disease (AD) age of onset (ADAOO) is crucial to investigate potential treatments, follow-up, and therapeutic interventions. Although genetic and non-genetic factors affecting ADAOO were elucidated by other research groups and ours, the comprehensive and sequential application of ML to provide an exact estimation of the actual ADAOO, instead of a high-confidence-interval ADAOO that may fall, remains to be explored. Here, we assessed the performance of ML algorithms for predicting ADAOO using two AD cohorts with early-onset familial AD and with late-onset sporadic AD, combining genetic and demographic variables. Performance of ML algorithms was assessed using the root mean squared error (RMSE), the R-squared (R2), and the mean absolute error (MAE) with a 10-fold cross-validation procedure. For predicting ADAOO in familial AD, boosting-based ML algorithms performed the best. In the sporadic cohort, boosting-based ML algorithms performed best in the training data set, while regularization methods best performed for unseen data. ML algorithms represent a feasible alternative to accurately predict ADAOO with little human intervention. Future studies may include predicting the speed of cognitive decline in our cohorts using ML.


2021 ◽  
Author(s):  
Bojan Bogdanovic ◽  
Tome Eftimov ◽  
Monika Simjanoska

Abstract Background: Alzheimer's disease is still a field of research with lots of open questions. The complexity of the disease prevents the early diagnosis before visible symptoms regarding the individual's cognitive capabilities occur. This research presents an in-depth analysis of a huge data set encompassing medical, cognitive and lifestyle's measurements from more than 12,000 individuals. Several hypothesis were established whose validity has been questioned considering the obtained results.Methods: The importance of appropriate experimental design is highly stressed in the research. Thus, a sequence of methods for handling missing data, redundancy, data imbalance, and correlation analysis have been applied for appropriate preprocessing of the data set, and consequently Random Forest and XGBoost models have been trained and evaluated with special attention to the hyperparameters tuning. Both of the models were explained by using the Shapley values produced by the SHAP method.Results: XGBoost produced the best f1-score of 0.84 and as such is considered to be highly competitive among those published in the literature. This achievement, however, was not the main contribution of this paper. This research's goal was to perform global and local interpretability of both the intelligent models and derive valuable conclusions over the established hypothesis. Those methods led to a single scheme which presents either positive, or, negative influence of the values of each of the features whose importance has been confirmed by means of Shapley values. This scheme might be considered as additional source of knowledge for the physicians and other experts whose concern is the exact diagnosis of early stage of Alzheimer's disease.Conclusion: The conclusions derived from the intelligent models interpretability rejected all the established hypothesis. This research clearly showed the importance of Machine learning explainability approach that opens the black box and clearly unveils the relationships among the features and the diagnoses.


2021 ◽  
pp. 002203452110357
Author(s):  
T. Chen ◽  
P.D. Marsh ◽  
N.N. Al-Hebshi

An intuitive, clinically relevant index of microbial dysbiosis as a summary statistic of subgingival microbiome profiles is needed. Here, we describe a subgingival microbial dysbiosis index (SMDI) based on machine learning analysis of published periodontitis/health 16S microbiome data. The raw sequencing data, split into training and test sets, were quality filtered, taxonomically assigned to the species level, and centered log-ratio transformed. The training data set was subject to random forest analysis to identify discriminating species (DS) between periodontitis and health. DS lists, compiled by various “Gini” importance score cutoffs, were used to compute the SMDI for samples in the training and test data sets as the mean centered log-ratio abundance of periodontitis-associated species subtracted by that of health-associated ones. Diagnostic accuracy was assessed with receiver operating characteristic analysis. An SMDI based on 49 DS provided the highest accuracy with areas under the curve of 0.96 and 0.92 in the training and test data sets, respectively, and ranged from −6 (most normobiotic) to 5 (most dysbiotic) with a value around zero discriminating most of the periodontitis and healthy samples. The top periodontitis-associated DS were Treponema denticola, Mogibacterium timidum, Fretibacterium spp., and Tannerella forsythia, while Actinomyces naeslundii and Streptococcus sanguinis were the top health-associated DS. The index was highly reproducible by hypervariable region. Applying the index to additional test data sets in which nitrate had been used to modulate the microbiome demonstrated that nitrate has dysbiosis-lowering properties in vitro and in vivo. Finally, 3 genera ( Treponema, Fretibacterium, and Actinomyces) were identified that could be used for calculation of a simplified SMDI with comparable accuracy. In conclusion, we have developed a nonbiased, reproducible, and easy-to-interpret index that can be used to identify patients/sites at risk of periodontitis, to assess the microbial response to treatment, and, importantly, as a quantitative tool in microbiome modulation studies.


2010 ◽  
Vol 22 (12) ◽  
pp. 2677-2684 ◽  
Author(s):  
Daniel S. Marcus ◽  
Anthony F. Fotenos ◽  
John G. Csernansky ◽  
John C. Morris ◽  
Randy L. Buckner

The Open Access Series of Imaging Studies is a series of neuroimaging data sets that are publicly available for study and analysis. The present MRI data set consists of a longitudinal collection of 150 subjects aged 60 to 96 years all acquired on the same scanner using identical sequences. Each subject was scanned on two or more visits, separated by at least 1 year for a total of 373 imaging sessions. Subjects were characterized using the Clinical Dementia Rating (CDR) as either nondemented or with very mild to mild Alzheimer's disease. Seventy-two of the subjects were characterized as nondemented throughout the study. Sixty-four of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with CDR 0.5 similar level of impairment to individuals elsewhere considered to have “mild cognitive impairment.” Another 14 subjects were characterized as nondemented at the time of their initial visit (CDR 0) and were subsequently characterized as demented at a later visit (CDR > 0). The subjects were all right-handed and include both men (n = 62) and women (n = 88). For each scanning session, three or four individual T1-weighted MRI scans were obtained. Multiple within-session acquisitions provide extremely high contrast to noise, making the data amenable to a wide range of analytic approaches including automated computational analysis. Automated calculation of whole-brain volume is presented to demonstrate use of the data for measuring differences associated with normal aging and Alzheimer's disease.


1991 ◽  
Vol 9 (5) ◽  
pp. 871-876 ◽  
Author(s):  
M J Ratain ◽  
J Robert ◽  
W J van der Vijgh

Although doxorubicin is one of the most commonly used antineoplastics, no studies to date have clearly related the area under the concentration-time curve (AUC) to toxicity or response. The limited sampling model has recently been shown to be a feasible method for estimating the AUC to facilitate pharmacodynamic studies. Data from two previous studies of doxorubicin pharmacokinetics were used, including 26 patients with sarcoma and five patients with breast cancer or unknown primary. The former were divided into a training data set of 15 patients and a test datum set of 11 patients, and the latter patients formed a second test data set. The model was developed by stepwise multiple regression on the training data set: AUC (nanogram hour per milliliter) = 17.39 C2 + 163 C48-111.0 [dose/(50 mg/m2)], where C2 and C48 are the concentrations at 2 and 48 hours after bolus dose. The model was subsequently validated on both test data sets: first test data set--mean predictive error (MPE), 4.7%; root mean square error (RMSE), 12.4%; second test data set--MPE, 4.5%, RMSE, 9.2%. An additional model was also generated using a simulated time point to estimate the total AUC for a daily x 3-day schedule: AUC (nanogram hour per milliliter) = 44.79 C2 + 175.65 C48 + 47.25 [dose/(25 mg/m2/d)], where the C48 is obtained just prior to the third dose. We conclude that the AUC of doxorubicin after bolus administration can be adequately estimated from two timed plasma concentrations.


2020 ◽  
Vol 10 (1) ◽  
pp. 55-60
Author(s):  
Owais Mujtaba Khanday ◽  
Samad Dadvandipour

Deep Neural Networks (DNN) in the past few years have revolutionized the computer vision by providing the best results on a large number of problems such as image classification, pattern recognition, and speech recognition. One of the essential models in deep learning used for image classification is convolutional neural networks. These networks can integrate a different number of features or so-called filters in a multi-layer fashion called convolutional layers. These models use convolutional, and pooling layers for feature abstraction and have neurons arranged in three dimensions: Height, Width, and Depth. Filters of 3 different sizes were used like 3×3, 5×5 and 7×7. It has been seen that the accuracy on the training data has been decreased from 100% to 97.8% as we increase the filter size and also the accuracy on the test data set decreases for 3×3 it is 98.7%, for 5×5 it is 98.5%, and for 7×7 it is 97.8%. The loss on the training data and test data per 10 epochs could be seen drastically increasing from 3.4% to 27.6% and 12.5% to 23.02%, respectively. Thus it is clear that using the filters having lesser dimensions is giving less loss than those having more dimensions. However, using the smaller filter size comes with the cost of computational complexity, which is very crucial in the case of larger data sets.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Louise Bloch ◽  
Christoph M. Friedrich ◽  

Abstract Background For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer’s disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification. Methods An ML workflow was developed and trained for a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. Results The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set. Conclusion The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements.


2018 ◽  
Vol 1 (1) ◽  
Author(s):  
Timon Cheng-Yi Liu ◽  
Quan-Guang Zhang ◽  
Chong-Yun Wu ◽  
Luo-Dan Yang ◽  
Ling Zhu ◽  
...  

Objective  In MSSE, we have divided male 2.5-month-old Sprague-Dawley rats into the following 4 groups: control (C), habitual swimming (SW), Alzheimer’s disease (AD) induction without swimming (AD), and habitual swimming and then AD induction (SA), and found the perfect resistance of habitual swimming to AD induction by using the P value statistics of the 5 behavior parameters of rats and the 23 physiological and biochemical parameters of their hippocampus. The topological difference  of four groups were further calculated in this paper by using quantitative difference (QD) and self-similar approach. Methods 1. The logarithm to base golden section τ (lt) is called golden logarithm. It was found that σ=ltσ ≈ 0.710439287156503. 2. For a process from x1 to x2, lx(1,2)=lt(x2/x1) and its absolute vale are called the process logarithm and its QD, QDx(1,2). There are QD threshold values (αx,βx,γx) of function x which can be calculated in terms of σ. The function x is kept to be constant if QDx(1,2) < αx. A function in/far from its function-specific homeostasis is called a normal/dysfunctional function. A normal function can resist a disturbance under its threshold so that QDx(1,2) < βx. A dysfunctional function is defined as the QD is significant if βx ≦QDx(1,2) < γx and extraordinarily significant if QDx(1,2) ≧ γx. 3. Self-similarity was studied in the fractal literature: a pattern is self-similar if it does not vary with spatial or temporal scale. First-order self-similarity condition leads to the power law between two data sets A = {xi} and B = {yi}; yi = ai xi if the QDi of ai and the average of {ai} is smaller than βmin=min{βi} and the average QD of {QDi} is smaller than αmin=min{αi}. 4. The σ algorithm for integrative biology was established based on high-order self-similarity. Those parameters that contribute to the topological difference were the biomarkers. Results The 28 dimension data set consisted of all the 28 parameters. The first-order self-similarity held true for the 28 dimension data sets between groups C and SW. The topological algorithm of other groups suggested three AD biomarkers, protein carbonyl, granules density of presynaptic synaptophysin in the hippocampal CA1 and malondialdehyde intensity. The first two biomarkers were completely reversed by exercise pretreatment, but the third biomarker was partially reversed. Conclusions  Exercise pretraining exerts partial benefits on AD that support its use as a promising new therapeutic option for prevention of neurodegeneration in the elderly and/or AD population. 


2021 ◽  
Author(s):  
Hye-Won Hwang ◽  
Jun-Ho Moon ◽  
Min-Gyu Kim ◽  
Richard E. Donatelli ◽  
Shin-Jae Lee

ABSTRACT Objectives To compare an automated cephalometric analysis based on the latest deep learning method of automatically identifying cephalometric landmarks (AI) with previously published AI according to the test style of the worldwide AI challenges at the International Symposium on Biomedical Imaging conferences held by the Institute of Electrical and Electronics Engineers (IEEE ISBI). Materials and Methods This latest AI was developed by using a total of 1983 cephalograms as training data. In the training procedures, a modification of a contemporary deep learning method, YOLO version 3 algorithm, was applied. Test data consisted of 200 cephalograms. To follow the same test style of the AI challenges at IEEE ISBI, a human examiner manually identified the IEEE ISBI-designated 19 cephalometric landmarks, both in training and test data sets, which were used as references for comparison. Then, the latest AI and another human examiner independently detected the same landmarks in the test data set. The test results were compared by the measures that appeared at IEEE ISBI: the success detection rate (SDR) and the success classification rates (SCR). Results SDR of the latest AI in the 2-mm range was 75.5% and SCR was 81.5%. These were greater than any other previous AIs. Compared to the human examiners, AI showed a superior success classification rate in some cephalometric analysis measures. Conclusions This latest AI seems to have superior performance compared to previous AI methods. It also seems to demonstrate cephalometric analysis comparable to human examiners.


Author(s):  
Mark Ellisman ◽  
Maryann Martone ◽  
Gabriel Soto ◽  
Eleizer Masliah ◽  
David Hessler ◽  
...  

Structurally-oriented biologists examine cells, tissues, organelles and macromolecules in order to gain insight into cellular and molecular physiology by relating structure to function. The understanding of these structures can be greatly enhanced by the use of techniques for the visualization and quantitative analysis of three-dimensional structure. Three projects from current research activities will be presented in order to illustrate both the present capabilities of computer aided techniques as well as their limitations and future possibilities.The first project concerns the three-dimensional reconstruction of the neuritic plaques found in the brains of patients with Alzheimer's disease. We have developed a software package “Synu” for investigation of 3D data sets which has been used in conjunction with laser confocal light microscopy to study the structure of the neuritic plaque. Tissue sections of autopsy samples from patients with Alzheimer's disease were double-labeled for tau, a cytoskeletal marker for abnormal neurites, and synaptophysin, a marker of presynaptic terminals.


Sign in / Sign up

Export Citation Format

Share Document