Pathogenic Variation in Colletotrichum gloeosporioides Infecting Stylosanthes spp. in a Center of Diversity in Brazil

Pathogenic variation in Colletotrichum gloeosporioides infecting species of the tropical pasture legume Stylosanthes at its center of diversity was determined from 296 isolates collected from wild host population and selected germ plasm of S. capitata, S. guianensis, S. scabra, and S. macrocephala in Brazil. A putative host differential set comprising 11 accessions was selected from a bioassay of 18 isolates on 19 host accessions using principal component analysis. A similar analysis of anthracnose severity data for a subset of 195 isolates on the 11 differentials indicated that an adequate summary of pathogenic variation could be obtained using only five of these differentials. Of the five differentials, S. seabrana ‘Primar’ was resistant and S. scabra ‘Fitzroy’ was susceptible to most isolates. A cluster analysis was used to determine eight natural race clusters using the 195 isolates. Linear discriminant functions were developed for eight race clusters using the 195 isolates as the training data set, and these were applied to classify a test data set of the remaining 101 isolates. All except 11 isolates of the test data set were classified into one of the eight race clusters. Over 10% of the 296 isolates were weakly pathogenic to all five differentials and another 40% were virulent on just one differential. The unclassified isolates represent six new races with unique virulence combinations, of which one isolate is virulent on all five differentials. The majority of isolates came from six field sites, and Shannon's index of diversity indicated considerable variation between sites. Pathogenic diversity was extensive at three sites where selected germ plasm were under evaluation, and complex race clusters and unclassified isolates representing new races were more prevalent at these sites compared with sites containing wild Stylosanthes populations.

Download Full-text

Synthetic Sonic Log Generation With Machine Learning: A Contest Summary From Five Methods

Petrophysics – The SPWLA Journal of Formation Evaluation and Reservoir Description ◽

10.30632/pjv62n4-2021a4 ◽

2021 ◽

Vol 62 (4) ◽

pp. 393-406

Author(s):

Yanxiang Yu ◽

◽

Chicheng Xu ◽

Siddharth Misra ◽

Weichang Li ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Short Term Memory ◽

Rock Physics ◽

Training Data ◽

Machine Learning Techniques ◽

Blind Test ◽

Data Set ◽

Benchmark Model ◽

Sonic Log

Compressional and shear sonic traveltime logs (DTC and DTS, respectively) are crucial for subsurface characterization and seismic-well tie. However, these two logs are often missing or incomplete in many oil and gas wells. Therefore, many petrophysical and geophysical workflows include sonic log synthetization or pseudo-log generation based on multivariate regression or rock physics relations. Started on March 1, 2020, and concluded on May 7, 2020, the SPWLA PDDA SIG hosted a contest aiming to predict the DTC and DTS logs from seven “easy-to-acquire” conventional logs using machine-learning methods (GitHub, 2020). In the contest, a total number of 20,525 data points with half-foot resolution from three wells was collected to train regression models using machine-learning techniques. Each data point had seven features, consisting of the conventional “easy-to-acquire” logs: caliper, neutron porosity, gamma ray (GR), deep resistivity, medium resistivity, photoelectric factor, and bulk density, respectively, as well as two sonic logs (DTC and DTS) as the target. The separate data set of 11,089 samples from a fourth well was then used as the blind test data set. The prediction performance of the model was evaluated using root mean square error (RMSE) as the metric, shown in the equation below: RMSE=sqrt(1/2*1/m* [∑_(i=1)^m▒〖(〖DTC〗_pred^i-〖DTC〗_true^i)〗^2 + 〖(〖DTS〗_pred^i-〖DTS〗_true^i)〗^2 ] In the benchmark model, (Yu et al., 2020), we used a Random Forest regressor and conducted minimal preprocessing to the training data set; an RMSE score of 17.93 was achieved on the test data set. The top five models from the contest, on average, beat the performance of our benchmark model by 27% in the RMSE score. In the paper, we will review these five solutions, including preprocess techniques and different machine-learning models, including neural network, long short-term memory (LSTM), and ensemble trees. We found that data cleaning and clustering were critical for improving the performance in all models.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

Personal Adaptive Method to Assess Mental Tension during Daily Life Using Heart Rate Variability

Methods of Information in Medicine ◽

10.3414/me11-01-0027 ◽

2012 ◽

Vol 51 (01) ◽

pp. 39-44 ◽

Cited By ~ 11

Author(s):

K. Matsuoka ◽

K. Yoshino

Keyword(s):

Heart Rate ◽

Heart Rate Variability ◽

Linear Regression ◽

Multiple Linear Regression ◽

Test Data ◽

Daily Life ◽

Pearson Correlation ◽

Multiple Linear Regression Model ◽

Training Data ◽

Data Set

SummaryObjectives: The aim of this study is to present a method of assessing psychological tension that is optimized to every individual on the basis of the heart rate variability (HRV) data which, to eliminate the influence of the inter-individual variability, are measured in a long time period during daily life.Methods: HRV and body accelerations were recorded from nine normal subjects for two months of normal daily life. Fourteen HRV indices were calculated with the HRV data at 512 seconds prior to the time of every mental tension level report. Data to be analyzed were limited to those with body accelerations of 30 mG (0.294 m/s2) and lower. Further, the differences from the reference values in the same time zone were calculated with both the mental tension score (Δtension) and HRV index values (ΔHRVI). The multiple linear regression model that estimates Δtension from the scores for principal components of ΔHRVI were then constructed for each individual. The data were divided into training data set and test data set in accordance with the twofold cross validation method. Multiple linear regression coefficients were determined using the training data set, and with the optimized model its generalization capability was checked using the test data set.Results: The subjects’ mean Pearson correlation coefficient was 0.52 with the training data set and 0.40 with the test data set. The subjects’ mean coefficient of determination was 0.28 with the training data set and 0.11 with the test data set.Conclusion: We proposed a method of assessing psychological tension that is optimized to every individual based on HRV data measured over a long period of daily life.

Download Full-text

Towards improving the accuracy of aortic transvalvular pressure gradients: rethinking Bernoulli

Medical & Biological Engineering & Computing ◽

10.1007/s11517-020-02186-w ◽

2020 ◽

Vol 58 (8) ◽

pp. 1667-1679

Author(s):

Benedikt Franke ◽

J. Weese ◽

I. Waechter-Stehle ◽

J. Brüning ◽

T. Kuehne ◽

...

Keyword(s):

Test Data ◽

Ground Truth ◽

Training Data ◽

Patient Specific ◽

Pressure Gradients ◽

Bernoulli Model ◽

Bernoulli Equation ◽

Data Set ◽

Non Invasive ◽

Adjusted Model

Abstract The transvalvular pressure gradient (TPG) is commonly estimated using the Bernoulli equation. However, the method is known to be inaccurate. Therefore, an adjusted Bernoulli model for accurate TPG assessment was developed and evaluated. Numerical simulations were used to calculate TPGCFD in patient-specific geometries of aortic stenosis as ground truth. Geometries, aortic valve areas (AVA), and flow rates were derived from computed tomography scans. Simulations were divided in a training data set (135 cases) and a test data set (36 cases). The training data was used to fit an adjusted Bernoulli model as a function of AVA and flow rate. The model-predicted TPGModel was evaluated using the test data set and also compared against the common Bernoulli equation (TPGB). TPGB and TPGModel both correlated well with TPGCFD (r > 0.94), but significantly overestimated it. The average difference between TPGModel and TPGCFD was much lower: 3.3 mmHg vs. 17.3 mmHg between TPGB and TPGCFD. Also, the standard error of estimate was lower for the adjusted model: SEEModel = 5.3 mmHg vs. SEEB = 22.3 mmHg. The adjusted model’s performance was more accurate than that of the conventional Bernoulli equation. The model might help to improve non-invasive assessment of TPG.

Download Full-text

Diagnostic assessment of a deep learning system for detecting atrial fibrillation in pulse waveforms

Heart ◽

10.1136/heartjnl-2018-313147 ◽

2018 ◽

Vol 104 (23) ◽

pp. 1921-1928 ◽

Cited By ~ 36

Author(s):

Ming-Zher Poh ◽

Yukkee Cheung Poh ◽

Pak-Hei Chan ◽

Chun-Ka Wong ◽

Louise Pun ◽

...

Keyword(s):

Atrial Fibrillation ◽

Deep Learning ◽

Test Data ◽

Predictive Value ◽

Characteristic Curve ◽

Performance Comparison ◽

Learning System ◽

Training Data ◽

Validation Data ◽

Data Set

ObjectiveTo evaluate the diagnostic performance of a deep learning system for automated detection of atrial fibrillation (AF) in photoplethysmographic (PPG) pulse waveforms.MethodsWe trained a deep convolutional neural network (DCNN) to detect AF in 17 s PPG waveforms using a training data set of 149 048 PPG waveforms constructed from several publicly available PPG databases. The DCNN was validated using an independent test data set of 3039 smartphone-acquired PPG waveforms from adults at high risk of AF at a general outpatient clinic against ECG tracings reviewed by two cardiologists. Six established AF detectors based on handcrafted features were evaluated on the same test data set for performance comparison.ResultsIn the validation data set (3039 PPG waveforms) consisting of three sequential PPG waveforms from 1013 participants (mean (SD) age, 68.4 (12.2) years; 46.8% men), the prevalence of AF was 2.8%. The area under the receiver operating characteristic curve (AUC) of the DCNN for AF detection was 0.997 (95% CI 0.996 to 0.999) and was significantly higher than all the other AF detectors (AUC range: 0.924–0.985). The sensitivity of the DCNN was 95.2% (95% CI 88.3% to 98.7%), specificity was 99.0% (95% CI 98.6% to 99.3%), positive predictive value (PPV) was 72.7% (95% CI 65.1% to 79.3%) and negative predictive value (NPV) was 99.9% (95% CI 99.7% to 100%) using a single 17 s PPG waveform. Using the three sequential PPG waveforms in combination (<1 min in total), the sensitivity was 100.0% (95% CI 87.7% to 100%), specificity was 99.6% (95% CI 99.0% to 99.9%), PPV was 87.5% (95% CI 72.5% to 94.9%) and NPV was 100% (95% CI 99.4% to 100%).ConclusionsIn this evaluation of PPG waveforms from adults screened for AF in a real-world primary care setting, the DCNN had high sensitivity, specificity, PPV and NPV for detecting AF, outperforming other state-of-the-art methods based on handcrafted features.

Download Full-text

Detection and Characterization of Physical Activity and Psychological Stress from Wristband Data

Signals ◽

10.3390/signals1020011 ◽

2020 ◽

Vol 1 (2) ◽

pp. 188-208

Author(s):

Mert Sevil ◽

Mudassir Rashid ◽

Mohammad Reza Askari ◽

Zacharie Maloney ◽

Iman Hajizadeh ◽

...

Keyword(s):

Physical Activity ◽

Signal Processing ◽

Feature Extraction ◽

Psychological Stress ◽

Training Data ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Linear Discriminant ◽

Physiological Variables

Wearable devices continuously measure multiple physiological variables to inform users of health and behavior indicators. The computed health indicators must rely on informative signals obtained by processing the raw physiological variables with powerful noise- and artifacts-filtering algorithms. In this study, we aimed to elucidate the effects of signal processing techniques on the accuracy of detecting and discriminating physical activity (PA) and acute psychological stress (APS) using physiological measurements (blood volume pulse, heart rate, skin temperature, galvanic skin response, and accelerometer) collected from a wristband. Data from 207 experiments involving 24 subjects were used to develop signal processing, feature extraction, and machine learning (ML) algorithms that can detect and discriminate PA and APS when they occur individually or concurrently, classify different types of PA and APS, and estimate energy expenditure (EE). Training data were used to generate feature variables from the physiological variables and develop ML models (naïve Bayes, decision tree, k-nearest neighbor, linear discriminant, ensemble learning, and support vector machine). Results from an independent labeled testing data set demonstrate that PA was detected and classified with an accuracy of 99.3%, and APS was detected and classified with an accuracy of 92.7%, whereas the simultaneous occurrences of both PA and APS were detected and classified with an accuracy of 89.9% (relative to actual class labels), and EE was estimated with a low mean absolute error of 0.02 metabolic equivalent of task (MET).The data filtering and adaptive noise cancellation techniques used to mitigate the effects of noise and artifacts on the classification results increased the detection and discrimination accuracy by 0.7% and 3.0% for PA and APS, respectively, and by 18% for EE estimation. The results demonstrate the physiological measurements from wristband devices are susceptible to noise and artifacts, and elucidate the effects of signal processing and feature extraction on the accuracy of detection, classification, and estimation of PA and APS.

Download Full-text

DEPTH ESTIMATION OF SHALLOW WATER USING MULTISPECTRAL SATELLITE IMAGERY SENTINEL-2A

Jurnal Segara ◽

10.15578/segara.v16i3.8562 ◽

2020 ◽

Vol 16 (3) ◽

Author(s):

Arip Rahman

Keyword(s):

Shallow Water ◽

Test Data ◽

Remote Sensing Data ◽

Depth Estimation ◽

Training Data ◽

Coefficient Of Determination ◽

Support Vector ◽

Data Set ◽

Svm Algorithm ◽

Sentinel 2A

Shallow water bathymetry estimation from remote sensing data has been increasing widespread, as an alternative to traditional bathymetry measurement that has disturbed by technical and logistic problem. Deriving bathymetry data from Sentinel 2A images, at visible wavelength (blue, green and red) 10 meter spatial resolution was carried out around the waters of the Kemujan Island Karimunjawa National Park Central Java. Amount of 1280 points data are used as training data sets and 854 points data as test data set produced from sounding. Dark Object Substraction (DOS) has been to correct atmospherically the Sentinel-2A images. Several algorithm has been applied to derive bathymetry data, including: linear transform, ratio transform and support vector machine (SVM). The highest correlation between depth prediction and observe resulted from SVM algorithm with a coefficient of determination (R2) 0.71 (training data) and 0.56 (test data). The assessment of the accuracy of the three methods using RMSE and MAE values, the SVM algorithm has the smallest value (< 1 m). This indicates that the SVM algorithm has a high accuracy compared to the other two methods. The bathymetry map derived from Sentinel 2A imagery cannot be used as a reference for navigation.

Download Full-text

Use of artificial intelligence for public health surveillance: a case study to develop a machine Learning-algorithm to estimate the incidence of diabetes mellitus in France

Archives of Public Health ◽

10.1186/s13690-021-00687-0 ◽

2021 ◽

Vol 79 (1) ◽

Author(s):

Romana Haneef ◽

Sofiane Kab ◽

Rok Hrzic ◽

Sonsoles Fuentes ◽

Sandrine Fosse-Edorh ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Test Data ◽

Public Health Surveillance ◽

Health Surveillance ◽

Data Sets ◽

Data Set ◽

Linear Discriminant ◽

Final Data ◽

Selection Of

Abstract Background The use of machine learning techniques is increasing in healthcare which allows to estimate and predict health outcomes from large administrative data sets more efficiently. The main objective of this study was to develop a generic machine learning (ML) algorithm to estimate the incidence of diabetes based on the number of reimbursements over the last 2 years. Methods We selected a final data set from a population-based epidemiological cohort (i.e., CONSTANCES) linked with French National Health Database (i.e., SNDS). To develop this algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. Coding variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. Validation of model with test data set and viii. Selection of the model. We used the area under the receiver operating characteristic curve (AUC) to select the best algorithm. Results The final data set used to develop the algorithm included 44,659 participants from CONSTANCES. Out of 3468 variables from SNDS linked to CONSTANCES cohort were coded, 23 variables were selected to train different algorithms. The final algorithm to estimate the incidence of diabetes was a Linear Discriminant Analysis model based on number of reimbursements of selected variables related to biological tests, drugs, medical acts and hospitalization without a procedure over the last 2 years. This algorithm has a sensitivity of 62%, a specificity of 67% and an accuracy of 67% [95% CI: 0.66–0.68]. Conclusions Supervised ML is an innovative tool for the development of new methods to exploit large health administrative databases. In context of InfAct project, we have developed and applied the first time a generic ML-algorithm to estimate the incidence of diabetes for public health surveillance. The ML-algorithm we have developed, has a moderate performance. The next step is to apply this algorithm on SNDS to estimate the incidence of type 2 diabetes cases. More research is needed to apply various MLTs to estimate the incidence of various health conditions.

Download Full-text

Influence of synchronization within a sensor network on machine learning results

Journal of Sensors and Sensor Systems ◽

10.5194/jsss-10-233-2021 ◽

2021 ◽

Vol 10 (2) ◽

pp. 233-245

Author(s):

Tanja Dorst ◽

Yannick Robin ◽

Sascha Eichstädt ◽

Andreas Schütze ◽

Tizian Schneider

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Data Augmentation ◽

Time Synchronization ◽

Fundamental Problem ◽

Training Data ◽

Sensor Data ◽

Test Bed ◽

Data Set ◽

Linear Discriminant

Abstract. Process sensor data allow for not only the control of industrial processes but also an assessment of plant conditions to detect fault conditions and wear by using sensor fusion and machine learning (ML). A fundamental problem is the data quality, which is limited, inter alia, by time synchronization problems. To examine the influence of time synchronization within a distributed sensor system on the prediction performance, a test bed for end-of-line tests, lifetime prediction, and condition monitoring of electromechanical cylinders is considered. The test bed drives the cylinder in a periodic cycle at maximum load, a 1 s period at constant drive speed is used to predict the remaining useful lifetime (RUL). The various sensors for vibration, force, etc. integrated into the test bed are sampled at rates between 10 kHz and 1 MHz. The sensor data are used to train a classification ML model to predict the RUL with a resolution of 1 % based on feature extraction, feature selection, and linear discriminant analysis (LDA) projection. In this contribution, artificial time shifts of up to 50 ms between individual sensors' cycles are introduced, and their influence on the performance of the RUL prediction is investigated. While the ML model achieves good results if no time shifts are introduced, we observed that applying the model trained with unmodified data only to data sets with time shifts results in very poor performance of the RUL prediction even for small time shifts of 0.1 ms. To achieve an acceptable performance also for time-shifted data and thus achieve a more robust model for application, different approaches were investigated. One approach is based on a modified feature extraction approach excluding the phase values after Fourier transformation; a second is based on extending the training data set by including artificially time-shifted data. This latter approach is thus similar to data augmentation used to improve training of neural networks.

Download Full-text

Validating the genomic signature of pediatric septic shock

Physiological Genomics ◽

10.1152/physiolgenomics.00025.2008 ◽

2008 ◽

Vol 34 (1) ◽

pp. 127-134 ◽

Cited By ~ 59

Author(s):

Natalie Cvijanovich ◽

Thomas P. Shanley ◽

Richard Lin ◽

Geoffrey L. Allen ◽

Neal J. Thomas ◽

...

Keyword(s):

Septic Shock ◽

Test Data ◽

Training Data ◽

Lymphocyte Function ◽

Data Set ◽

Class Prediction ◽

Expression Signature ◽

Genome Wide ◽

Genome Wide Expression ◽

Normal Controls

We previously generated genome-wide expression data (microarray) from children with septic shock having the potential to lead the field into novel areas of investigation. Herein we seek to validate our data through a bioinformatic approach centered on a validation patient cohort. Forty-two children with a clinical diagnosis of septic shock and 15 normal controls served as the training data set, while 30 separate children with septic shock and 14 separate normal controls served as the test data set. Class prediction modeling using the training data set and the previously reported genome-wide expression signature of pediatric septic shock correctly identified 95–100% of controls and septic shock patients in the test data set, depending on the class prediction algorithm and the gene selection method. Subjecting the test data set to an identical filtering strategy as that used for the training data set, demonstrated 75% concordance between the two gene lists. Subjecting the test data set to a purely statistical filtering strategy, with highly stringent correction for multiple comparisons, demonstrated <50% concordance with the previous gene filtering strategy. However, functional analysis of this statistics-based gene list demonstrated similar functional annotations and signaling pathways as that seen in the training data set. In particular, we validated that pediatric septic shock is characterized by large-scale repression of genes related to zinc homeostasis and lymphocyte function. These data demonstrate that the previously reported genome-wide expression signature of pediatric septic shock is applicable to a validation cohort of patients.

Download Full-text