Validating the genomic signature of pediatric septic shock

We previously generated genome-wide expression data (microarray) from children with septic shock having the potential to lead the field into novel areas of investigation. Herein we seek to validate our data through a bioinformatic approach centered on a validation patient cohort. Forty-two children with a clinical diagnosis of septic shock and 15 normal controls served as the training data set, while 30 separate children with septic shock and 14 separate normal controls served as the test data set. Class prediction modeling using the training data set and the previously reported genome-wide expression signature of pediatric septic shock correctly identified 95–100% of controls and septic shock patients in the test data set, depending on the class prediction algorithm and the gene selection method. Subjecting the test data set to an identical filtering strategy as that used for the training data set, demonstrated 75% concordance between the two gene lists. Subjecting the test data set to a purely statistical filtering strategy, with highly stringent correction for multiple comparisons, demonstrated <50% concordance with the previous gene filtering strategy. However, functional analysis of this statistics-based gene list demonstrated similar functional annotations and signaling pathways as that seen in the training data set. In particular, we validated that pediatric septic shock is characterized by large-scale repression of genes related to zinc homeostasis and lymphocyte function. These data demonstrate that the previously reported genome-wide expression signature of pediatric septic shock is applicable to a validation cohort of patients.

Download Full-text

Synthetic Sonic Log Generation With Machine Learning: A Contest Summary From Five Methods

Petrophysics – The SPWLA Journal of Formation Evaluation and Reservoir Description ◽

10.30632/pjv62n4-2021a4 ◽

2021 ◽

Vol 62 (4) ◽

pp. 393-406

Author(s):

Yanxiang Yu ◽

◽

Chicheng Xu ◽

Siddharth Misra ◽

Weichang Li ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Short Term Memory ◽

Rock Physics ◽

Training Data ◽

Machine Learning Techniques ◽

Blind Test ◽

Data Set ◽

Benchmark Model ◽

Sonic Log

Compressional and shear sonic traveltime logs (DTC and DTS, respectively) are crucial for subsurface characterization and seismic-well tie. However, these two logs are often missing or incomplete in many oil and gas wells. Therefore, many petrophysical and geophysical workflows include sonic log synthetization or pseudo-log generation based on multivariate regression or rock physics relations. Started on March 1, 2020, and concluded on May 7, 2020, the SPWLA PDDA SIG hosted a contest aiming to predict the DTC and DTS logs from seven “easy-to-acquire” conventional logs using machine-learning methods (GitHub, 2020). In the contest, a total number of 20,525 data points with half-foot resolution from three wells was collected to train regression models using machine-learning techniques. Each data point had seven features, consisting of the conventional “easy-to-acquire” logs: caliper, neutron porosity, gamma ray (GR), deep resistivity, medium resistivity, photoelectric factor, and bulk density, respectively, as well as two sonic logs (DTC and DTS) as the target. The separate data set of 11,089 samples from a fourth well was then used as the blind test data set. The prediction performance of the model was evaluated using root mean square error (RMSE) as the metric, shown in the equation below: RMSE=sqrt(1/2*1/m* [∑_(i=1)^m▒〖(〖DTC〗_pred^i-〖DTC〗_true^i)〗^2 + 〖(〖DTS〗_pred^i-〖DTS〗_true^i)〗^2 ] In the benchmark model, (Yu et al., 2020), we used a Random Forest regressor and conducted minimal preprocessing to the training data set; an RMSE score of 17.93 was achieved on the test data set. The top five models from the contest, on average, beat the performance of our benchmark model by 27% in the RMSE score. In the paper, we will review these five solutions, including preprocess techniques and different machine-learning models, including neural network, long short-term memory (LSTM), and ensemble trees. We found that data cleaning and clustering were critical for improving the performance in all models.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

Personal Adaptive Method to Assess Mental Tension during Daily Life Using Heart Rate Variability

Methods of Information in Medicine ◽

10.3414/me11-01-0027 ◽

2012 ◽

Vol 51 (01) ◽

pp. 39-44 ◽

Cited By ~ 11

Author(s):

K. Matsuoka ◽

K. Yoshino

Keyword(s):

Heart Rate ◽

Heart Rate Variability ◽

Linear Regression ◽

Multiple Linear Regression ◽

Test Data ◽

Daily Life ◽

Pearson Correlation ◽

Multiple Linear Regression Model ◽

Training Data ◽

Data Set

SummaryObjectives: The aim of this study is to present a method of assessing psychological tension that is optimized to every individual on the basis of the heart rate variability (HRV) data which, to eliminate the influence of the inter-individual variability, are measured in a long time period during daily life.Methods: HRV and body accelerations were recorded from nine normal subjects for two months of normal daily life. Fourteen HRV indices were calculated with the HRV data at 512 seconds prior to the time of every mental tension level report. Data to be analyzed were limited to those with body accelerations of 30 mG (0.294 m/s2) and lower. Further, the differences from the reference values in the same time zone were calculated with both the mental tension score (Δtension) and HRV index values (ΔHRVI). The multiple linear regression model that estimates Δtension from the scores for principal components of ΔHRVI were then constructed for each individual. The data were divided into training data set and test data set in accordance with the twofold cross validation method. Multiple linear regression coefficients were determined using the training data set, and with the optimized model its generalization capability was checked using the test data set.Results: The subjects’ mean Pearson correlation coefficient was 0.52 with the training data set and 0.40 with the test data set. The subjects’ mean coefficient of determination was 0.28 with the training data set and 0.11 with the test data set.Conclusion: We proposed a method of assessing psychological tension that is optimized to every individual based on HRV data measured over a long period of daily life.

Download Full-text

Towards improving the accuracy of aortic transvalvular pressure gradients: rethinking Bernoulli

Medical & Biological Engineering & Computing ◽

10.1007/s11517-020-02186-w ◽

2020 ◽

Vol 58 (8) ◽

pp. 1667-1679

Author(s):

Benedikt Franke ◽

J. Weese ◽

I. Waechter-Stehle ◽

J. Brüning ◽

T. Kuehne ◽

...

Keyword(s):

Test Data ◽

Ground Truth ◽

Training Data ◽

Patient Specific ◽

Pressure Gradients ◽

Bernoulli Model ◽

Bernoulli Equation ◽

Data Set ◽

Non Invasive ◽

Adjusted Model

Abstract The transvalvular pressure gradient (TPG) is commonly estimated using the Bernoulli equation. However, the method is known to be inaccurate. Therefore, an adjusted Bernoulli model for accurate TPG assessment was developed and evaluated. Numerical simulations were used to calculate TPGCFD in patient-specific geometries of aortic stenosis as ground truth. Geometries, aortic valve areas (AVA), and flow rates were derived from computed tomography scans. Simulations were divided in a training data set (135 cases) and a test data set (36 cases). The training data was used to fit an adjusted Bernoulli model as a function of AVA and flow rate. The model-predicted TPGModel was evaluated using the test data set and also compared against the common Bernoulli equation (TPGB). TPGB and TPGModel both correlated well with TPGCFD (r > 0.94), but significantly overestimated it. The average difference between TPGModel and TPGCFD was much lower: 3.3 mmHg vs. 17.3 mmHg between TPGB and TPGCFD. Also, the standard error of estimate was lower for the adjusted model: SEEModel = 5.3 mmHg vs. SEEB = 22.3 mmHg. The adjusted model’s performance was more accurate than that of the conventional Bernoulli equation. The model might help to improve non-invasive assessment of TPG.

Download Full-text

Diagnostic assessment of a deep learning system for detecting atrial fibrillation in pulse waveforms

Heart ◽

10.1136/heartjnl-2018-313147 ◽

2018 ◽

Vol 104 (23) ◽

pp. 1921-1928 ◽

Cited By ~ 36

Author(s):

Ming-Zher Poh ◽

Yukkee Cheung Poh ◽

Pak-Hei Chan ◽

Chun-Ka Wong ◽

Louise Pun ◽

...

Keyword(s):

Atrial Fibrillation ◽

Deep Learning ◽

Test Data ◽

Predictive Value ◽

Characteristic Curve ◽

Performance Comparison ◽

Learning System ◽

Training Data ◽

Validation Data ◽

Data Set

ObjectiveTo evaluate the diagnostic performance of a deep learning system for automated detection of atrial fibrillation (AF) in photoplethysmographic (PPG) pulse waveforms.MethodsWe trained a deep convolutional neural network (DCNN) to detect AF in 17 s PPG waveforms using a training data set of 149 048 PPG waveforms constructed from several publicly available PPG databases. The DCNN was validated using an independent test data set of 3039 smartphone-acquired PPG waveforms from adults at high risk of AF at a general outpatient clinic against ECG tracings reviewed by two cardiologists. Six established AF detectors based on handcrafted features were evaluated on the same test data set for performance comparison.ResultsIn the validation data set (3039 PPG waveforms) consisting of three sequential PPG waveforms from 1013 participants (mean (SD) age, 68.4 (12.2) years; 46.8% men), the prevalence of AF was 2.8%. The area under the receiver operating characteristic curve (AUC) of the DCNN for AF detection was 0.997 (95% CI 0.996 to 0.999) and was significantly higher than all the other AF detectors (AUC range: 0.924–0.985). The sensitivity of the DCNN was 95.2% (95% CI 88.3% to 98.7%), specificity was 99.0% (95% CI 98.6% to 99.3%), positive predictive value (PPV) was 72.7% (95% CI 65.1% to 79.3%) and negative predictive value (NPV) was 99.9% (95% CI 99.7% to 100%) using a single 17 s PPG waveform. Using the three sequential PPG waveforms in combination (<1 min in total), the sensitivity was 100.0% (95% CI 87.7% to 100%), specificity was 99.6% (95% CI 99.0% to 99.9%), PPV was 87.5% (95% CI 72.5% to 94.9%) and NPV was 100% (95% CI 99.4% to 100%).ConclusionsIn this evaluation of PPG waveforms from adults screened for AF in a real-world primary care setting, the DCNN had high sensitivity, specificity, PPV and NPV for detecting AF, outperforming other state-of-the-art methods based on handcrafted features.

Download Full-text

Completing the ENCODE3 compendium yields accurate imputations across a variety of assays and human biosamples

10.1101/533273 ◽

2019 ◽

Cited By ~ 7

Author(s):

Jacob Schreiber ◽

Jeffrey Bilmes ◽

William Stafford Noble

Keyword(s):

Biological Activity ◽

Protein Binding ◽

Histone Modification ◽

Chromatin Accessibility ◽

Training Data ◽

Data Sets ◽

Cellular Mechanisms ◽

Data Set ◽

Genome Wide

AbstractMotivationRecent efforts to describe the human epigenome have yielded thousands of uniformly processed epigenomic and transcriptomic data sets. These data sets characterize a rich variety of biological activity in hundreds of human cell lines and tissues (“biosamples”). Understanding these data sets, and specifically how they differ across biosamples, can help explain many cellular mechanisms, particularly those driving development and disease. However, due primarily to cost, the total number of assays that can be performed is limited. Previously described imputation approaches, such as Avocado, have sought to overcome this limitation by predicting genome-wide epigenomics experiments using learned associations among available epigenomic data sets. However, these previous imputations have focused primarily on measurements of histone modification and chromatin accessibility, despite other biological activity being crucially important.ResultsWe applied Avocado to a data set of 3,814 tracks of data derived from the ENCODE compendium, spanning 400 human biosamples and 84 assays. The resulting imputations cover measurements of chromatin accessibility, histone modification, transcription, and protein binding. We demonstrate the quality of these imputations by comprehensively evaluating the model’s predictions and by showing significant improvements in protein binding performance compared to the top models in an ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model, achieving high accuracy at predicting protein binding, even with only a single track of training data.AvailabilityTutorials and source code are available under an Apache 2.0 license at https://github.com/jmschrei/[email protected] or [email protected]

Download Full-text

DEPTH ESTIMATION OF SHALLOW WATER USING MULTISPECTRAL SATELLITE IMAGERY SENTINEL-2A

Jurnal Segara ◽

10.15578/segara.v16i3.8562 ◽

2020 ◽

Vol 16 (3) ◽

Author(s):

Arip Rahman

Keyword(s):

Shallow Water ◽

Test Data ◽

Remote Sensing Data ◽

Depth Estimation ◽

Training Data ◽

Coefficient Of Determination ◽

Support Vector ◽

Data Set ◽

Svm Algorithm ◽

Sentinel 2A

Shallow water bathymetry estimation from remote sensing data has been increasing widespread, as an alternative to traditional bathymetry measurement that has disturbed by technical and logistic problem. Deriving bathymetry data from Sentinel 2A images, at visible wavelength (blue, green and red) 10 meter spatial resolution was carried out around the waters of the Kemujan Island Karimunjawa National Park Central Java. Amount of 1280 points data are used as training data sets and 854 points data as test data set produced from sounding. Dark Object Substraction (DOS) has been to correct atmospherically the Sentinel-2A images. Several algorithm has been applied to derive bathymetry data, including: linear transform, ratio transform and support vector machine (SVM). The highest correlation between depth prediction and observe resulted from SVM algorithm with a coefficient of determination (R2) 0.71 (training data) and 0.56 (test data). The assessment of the accuracy of the three methods using RMSE and MAE values, the SVM algorithm has the smallest value (< 1 m). This indicates that the SVM algorithm has a high accuracy compared to the other two methods. The bathymetry map derived from Sentinel 2A imagery cannot be used as a reference for navigation.

Download Full-text

Genome-wide expression profiling in pediatric septic shock

Pediatric Research ◽

10.1038/pr.2013.11 ◽

2013 ◽

Vol 73 (2-4) ◽

pp. 564-569 ◽

Cited By ~ 33

Author(s):

Hector R. Wong

Keyword(s):

Septic Shock ◽

Expression Profiling ◽

Genome Wide ◽

Genome Wide Expression

Download Full-text

Pathogenic Variation in Colletotrichum gloeosporioides Infecting Stylosanthes spp. in a Center of Diversity in Brazil

Phytopathology ◽

10.1094/phyto.2002.92.5.553 ◽

2002 ◽

Vol 92 (5) ◽

pp. 553-562 ◽

Cited By ~ 12

Author(s):

S. Chakraborty ◽

C. D. Fernandes ◽

M. J. d' A. Charchar ◽

M. R. Thomas

Keyword(s):

Test Data ◽

Colletotrichum Gloeosporioides ◽

Germ Plasm ◽

Training Data ◽

Data Set ◽

Linear Discriminant ◽

Pathogenic Variation ◽

Wild Host ◽

Center Of Diversity ◽

New Races

Pathogenic variation in Colletotrichum gloeosporioides infecting species of the tropical pasture legume Stylosanthes at its center of diversity was determined from 296 isolates collected from wild host population and selected germ plasm of S. capitata, S. guianensis, S. scabra, and S. macrocephala in Brazil. A putative host differential set comprising 11 accessions was selected from a bioassay of 18 isolates on 19 host accessions using principal component analysis. A similar analysis of anthracnose severity data for a subset of 195 isolates on the 11 differentials indicated that an adequate summary of pathogenic variation could be obtained using only five of these differentials. Of the five differentials, S. seabrana ‘Primar’ was resistant and S. scabra ‘Fitzroy’ was susceptible to most isolates. A cluster analysis was used to determine eight natural race clusters using the 195 isolates. Linear discriminant functions were developed for eight race clusters using the 195 isolates as the training data set, and these were applied to classify a test data set of the remaining 101 isolates. All except 11 isolates of the test data set were classified into one of the eight race clusters. Over 10% of the 296 isolates were weakly pathogenic to all five differentials and another 40% were virulent on just one differential. The unclassified isolates represent six new races with unique virulence combinations, of which one isolate is virulent on all five differentials. The majority of isolates came from six field sites, and Shannon's index of diversity indicated considerable variation between sites. Pathogenic diversity was extensive at three sites where selected germ plasm were under evaluation, and complex race clusters and unclassified isolates representing new races were more prevalent at these sites compared with sites containing wild Stylosanthes populations.

Download Full-text

Development and Validation of a Score for Screening Suicide of Patients With Neuroendocrine Neoplasms

Frontiers in Psychiatry ◽

10.3389/fpsyt.2021.638152 ◽

2021 ◽

Vol 12 ◽

Author(s):

Lili Lu ◽

Yuru Shang ◽

Dietmar Zechner ◽

Christina Susanne Mullins ◽

Michael Linnebacher ◽

...

Keyword(s):

Life Expectancy ◽

Test Data ◽

Characteristic Curve ◽

Life Quality ◽

Training Data ◽

Neuroendocrine Neoplasm ◽

Neuroendocrine Neoplasms ◽

Validation Data ◽

Data Set ◽

Commit Suicide

Background: If the diagnosis of neuroendocrine neoplasm (NEN) increases the risk of patients to commit suicide has not been investigated so far. Identifying NEN patients at risk to commit suicide is important to increase their life quality and life expectancy.Methods and findings: Cancer cases were extracted from the Surveillance, Epidemiology, and End Results program and were divided into the NEN and the non-NEN cohorts. Subsequently, the NEN patients were randomly split into a training data set and a validation data set. Analyzing the training data set, we developed a score for assessing the risk to commit suicide for patients with NEN. In addition, we validated the score using the validation data set and evaluated, if this score could also be applied to other cancer entities by using the test data set, a non-NEN cohort. The odds ratio (OR) of suicide between NEN and non-NEN patients was determined. Moreover, the performance of a score was evaluated by the receiver operating characteristic curve and the area under the curve (AUC). Compared to non-NEN, NEN significantly increased the risk of suicide to 1.8-fold (NEN vs. non-NEN; OR, 1.832; P < 0.001). In addition, we observed that age, gender, race, marital status, tumor stage, histologic grade, surgery, and chemotherapy were associated with suicide among NEN patients; and a synthesized score based on these factors could significantly distinguish suicide individuals from non-suicide individuals in the training data set (AUC, 0.829; P < 0.001) and in the validation data set (AUC, 0.735; P < 0.001). This score also had a good performance when it was assessed by the test data set (AUC, 0.690; P < 0.001). This demonstrates that the score might also be applicable to other cancer entities.Conclusions: This population-based study suggests that NEN patients have a higher risk of suicide than non-NEN patients. In addition, this study provided a score, which can identify NEN patients at high-risk of committing suicide. Thus, this score in combination with current screening and prevention strategies for suicide may improve life quality and life expectancy of NEN patients.

Download Full-text