SMDI: An Index for Measuring Subgingival Microbial Dysbiosis

Journal of Dental Research ◽

10.1177/00220345211035775 ◽

2021 ◽

pp. 002203452110357

Author(s):

T. Chen ◽

P.D. Marsh ◽

N.N. Al-Hebshi

Keyword(s):

Test Data ◽

Training Data ◽

Response To Treatment ◽

Data Sets ◽

Sequencing Data ◽

Characteristic Analysis ◽

Data Set ◽

Microbial Dysbiosis ◽

Log Ratio

An intuitive, clinically relevant index of microbial dysbiosis as a summary statistic of subgingival microbiome profiles is needed. Here, we describe a subgingival microbial dysbiosis index (SMDI) based on machine learning analysis of published periodontitis/health 16S microbiome data. The raw sequencing data, split into training and test sets, were quality filtered, taxonomically assigned to the species level, and centered log-ratio transformed. The training data set was subject to random forest analysis to identify discriminating species (DS) between periodontitis and health. DS lists, compiled by various “Gini” importance score cutoffs, were used to compute the SMDI for samples in the training and test data sets as the mean centered log-ratio abundance of periodontitis-associated species subtracted by that of health-associated ones. Diagnostic accuracy was assessed with receiver operating characteristic analysis. An SMDI based on 49 DS provided the highest accuracy with areas under the curve of 0.96 and 0.92 in the training and test data sets, respectively, and ranged from −6 (most normobiotic) to 5 (most dysbiotic) with a value around zero discriminating most of the periodontitis and healthy samples. The top periodontitis-associated DS were Treponema denticola, Mogibacterium timidum, Fretibacterium spp., and Tannerella forsythia, while Actinomyces naeslundii and Streptococcus sanguinis were the top health-associated DS. The index was highly reproducible by hypervariable region. Applying the index to additional test data sets in which nitrate had been used to modulate the microbiome demonstrated that nitrate has dysbiosis-lowering properties in vitro and in vivo. Finally, 3 genera ( Treponema, Fretibacterium, and Actinomyces) were identified that could be used for calculation of a simplified SMDI with comparable accuracy. In conclusion, we have developed a nonbiased, reproducible, and easy-to-interpret index that can be used to identify patients/sites at risk of periodontitis, to assess the microbial response to treatment, and, importantly, as a quantitative tool in microbiome modulation studies.

Get full-text (via PubEx)

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Get full-text (via PubEx)

Limited sampling models for doxorubicin pharmacokinetics.

Journal of Clinical Oncology ◽

10.1200/jco.1991.9.5.871 ◽

1991 ◽

Vol 9 (5) ◽

pp. 871-876 ◽

Cited By ~ 31

Author(s):

M J Ratain ◽

J Robert ◽

W J van der Vijgh

Keyword(s):

Test Data ◽

Plasma Concentrations ◽

Training Data ◽

Stepwise Multiple Regression ◽

Unknown Primary ◽

Data Sets ◽

Bolus Administration ◽

Time Curve ◽

Data Set ◽

Limited Sampling

Although doxorubicin is one of the most commonly used antineoplastics, no studies to date have clearly related the area under the concentration-time curve (AUC) to toxicity or response. The limited sampling model has recently been shown to be a feasible method for estimating the AUC to facilitate pharmacodynamic studies. Data from two previous studies of doxorubicin pharmacokinetics were used, including 26 patients with sarcoma and five patients with breast cancer or unknown primary. The former were divided into a training data set of 15 patients and a test datum set of 11 patients, and the latter patients formed a second test data set. The model was developed by stepwise multiple regression on the training data set: AUC (nanogram hour per milliliter) = 17.39 C2 + 163 C48-111.0 [dose/(50 mg/m2)], where C2 and C48 are the concentrations at 2 and 48 hours after bolus dose. The model was subsequently validated on both test data sets: first test data set--mean predictive error (MPE), 4.7%; root mean square error (RMSE), 12.4%; second test data set--MPE, 4.5%, RMSE, 9.2%. An additional model was also generated using a simulated time point to estimate the total AUC for a daily x 3-day schedule: AUC (nanogram hour per milliliter) = 44.79 C2 + 175.65 C48 + 47.25 [dose/(25 mg/m2/d)], where the C48 is obtained just prior to the third dose. We conclude that the AUC of doxorubicin after bolus administration can be adequately estimated from two timed plasma concentrations.

Get full-text (via PubEx)

Convolutional Neural Networks and Impact of Filter Sizes on Image Classification

Multidiszciplináris Tudományok ◽

10.35925/j.multi.2020.1.7 ◽

2020 ◽

Vol 10 (1) ◽

pp. 55-60

Author(s):

Owais Mujtaba Khanday ◽

Samad Dadvandipour

Keyword(s):

Neural Networks ◽

Image Classification ◽

Convolutional Neural Networks ◽

Test Data ◽

Training Data ◽

Three Dimensions ◽

Data Sets ◽

Data Set ◽

Classification Pattern ◽

Filter Size

Deep Neural Networks (DNN) in the past few years have revolutionized the computer vision by providing the best results on a large number of problems such as image classification, pattern recognition, and speech recognition. One of the essential models in deep learning used for image classification is convolutional neural networks. These networks can integrate a different number of features or so-called filters in a multi-layer fashion called convolutional layers. These models use convolutional, and pooling layers for feature abstraction and have neurons arranged in three dimensions: Height, Width, and Depth. Filters of 3 different sizes were used like 3×3, 5×5 and 7×7. It has been seen that the accuracy on the training data has been decreased from 100% to 97.8% as we increase the filter size and also the accuracy on the test data set decreases for 3×3 it is 98.7%, for 5×5 it is 98.5%, and for 7×7 it is 97.8%. The loss on the training data and test data per 10 epochs could be seen drastically increasing from 3.4% to 27.6% and 12.5% to 23.02%, respectively. Thus it is clear that using the filters having lesser dimensions is giving less loss than those having more dimensions. However, using the smaller filter size comes with the cost of computational complexity, which is very crucial in the case of larger data sets.

Get full-text (via PubEx)

Evaluation of automated cephalometric analysis based on the latest deep learning method

The Angle Orthodontist ◽

10.2319/021220-100.1 ◽

2021 ◽

Author(s):

Hye-Won Hwang ◽

Jun-Ho Moon ◽

Min-Gyu Kim ◽

Richard E. Donatelli ◽

Shin-Jae Lee

Keyword(s):

Deep Learning ◽

Test Data ◽

Training Data ◽

Superior Performance ◽

Cephalometric Analysis ◽

Data Sets ◽

Learning Method ◽

Test Results ◽

Classification Rate ◽

Data Set

ABSTRACT Objectives To compare an automated cephalometric analysis based on the latest deep learning method of automatically identifying cephalometric landmarks (AI) with previously published AI according to the test style of the worldwide AI challenges at the International Symposium on Biomedical Imaging conferences held by the Institute of Electrical and Electronics Engineers (IEEE ISBI). Materials and Methods This latest AI was developed by using a total of 1983 cephalograms as training data. In the training procedures, a modification of a contemporary deep learning method, YOLO version 3 algorithm, was applied. Test data consisted of 200 cephalograms. To follow the same test style of the AI challenges at IEEE ISBI, a human examiner manually identified the IEEE ISBI-designated 19 cephalometric landmarks, both in training and test data sets, which were used as references for comparison. Then, the latest AI and another human examiner independently detected the same landmarks in the test data set. The test results were compared by the measures that appeared at IEEE ISBI: the success detection rate (SDR) and the success classification rates (SCR). Results SDR of the latest AI in the 2-mm range was 75.5% and SCR was 81.5%. These were greater than any other previous AIs. Compared to the human examiners, AI showed a superior success classification rate in some cephalometric analysis measures. Conclusions This latest AI seems to have superior performance compared to previous AI methods. It also seems to demonstrate cephalometric analysis comparable to human examiners.

Get full-text (via PubEx)

Improving SAR Altimeter processing over the coastal zone and inland waters - the ESA HYDROCOASTAL project

10.5194/egusphere-egu21-9 ◽

2021 ◽

Author(s):

David Cotton ◽

Keyword(s):

Coastal Zone ◽

Test Data ◽

River Discharge ◽

Altimeter Data ◽

Inland Waters ◽

Data Sets ◽

Data Set ◽

Discharge Data ◽

Processing Algorithms ◽

The Impact

IntroductionHYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.A series of case studies will assess these products in terms of their scientific impacts.All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided&#160;ObjectivesThe scientific objectives of HYDROCOASTAL are to enhance our understanding&#160; of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changesThe technical objectives are to develop and evaluate&#160; new SAR&#160; and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.Project&#160; OutlineThere are four tasks to the project<ul><li>Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters</li> <li>Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets</li> <li>Impacts Assessment: The impact of these global products will be assess in a series of Case Studies</li> <li>Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.</li> </ul>&#160;PresentationThe presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.&#160;

Get full-text (via PubEx)

Application of the C4.5 Algorithm to Predict the Types of Disease in Pigs Based on Android

JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) ◽

10.24843/jlk.2021.v10.i01.p14 ◽

2021 ◽

Vol 10 (1) ◽

pp. 105

Author(s):

I Gusti Ayu Purnami Indryaswari ◽

Ida Bagus Made Mahendra

Keyword(s):

Programming Language ◽

Test Data ◽

Training Data ◽

Data Sets ◽

Android Application ◽

C4.5 Algorithm ◽

Sqlite Database

Many Indonesian people, especially in Bali, make pigs as livestock. Pig livestock are susceptible to various types of diseases and there have been many cases of pig deaths due to diseases that cause losses to breeders. Therefore, the author wants to create an Android-based application that can predict the type of disease in pigs by applying the C4.5 Algorithm. The C4.5 algorithm is an algorithm for classifying data in order to obtain a rule that is used to predict something. In this study, 50 training data sets were used with 8 types of diseases in pigs and 31 symptoms of disease. which is then inputted into the system so that the data is processed so that the system in the form of an Android application can predict the type of disease in pigs. In the testing process, it was carried out by testing 15 test data sets and producing an accuracy value that is 86.7%. In testing the application features built using the Kotlin programming language and the SQLite database, it has been running as expected.

Get full-text (via PubEx)

Synthetic Sonic Log Generation With Machine Learning: A Contest Summary From Five Methods

Petrophysics – The SPWLA Journal of Formation Evaluation and Reservoir Description ◽

10.30632/pjv62n4-2021a4 ◽

2021 ◽

Vol 62 (4) ◽

pp. 393-406

Author(s):

Yanxiang Yu ◽

◽

Chicheng Xu ◽

Siddharth Misra ◽

Weichang Li ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Short Term Memory ◽

Rock Physics ◽

Training Data ◽

Machine Learning Techniques ◽

Blind Test ◽

Data Set ◽

Benchmark Model ◽

Sonic Log

Compressional and shear sonic traveltime logs (DTC and DTS, respectively) are crucial for subsurface characterization and seismic-well tie. However, these two logs are often missing or incomplete in many oil and gas wells. Therefore, many petrophysical and geophysical workflows include sonic log synthetization or pseudo-log generation based on multivariate regression or rock physics relations. Started on March 1, 2020, and concluded on May 7, 2020, the SPWLA PDDA SIG hosted a contest aiming to predict the DTC and DTS logs from seven “easy-to-acquire” conventional logs using machine-learning methods (GitHub, 2020). In the contest, a total number of 20,525 data points with half-foot resolution from three wells was collected to train regression models using machine-learning techniques. Each data point had seven features, consisting of the conventional “easy-to-acquire” logs: caliper, neutron porosity, gamma ray (GR), deep resistivity, medium resistivity, photoelectric factor, and bulk density, respectively, as well as two sonic logs (DTC and DTS) as the target. The separate data set of 11,089 samples from a fourth well was then used as the blind test data set. The prediction performance of the model was evaluated using root mean square error (RMSE) as the metric, shown in the equation below: RMSE=sqrt(1/2*1/m* [∑_(i=1)^m▒〖(〖DTC〗_pred^i-〖DTC〗_true^i)〗^2 + 〖(〖DTS〗_pred^i-〖DTS〗_true^i)〗^2 ] In the benchmark model, (Yu et al., 2020), we used a Random Forest regressor and conducted minimal preprocessing to the training data set; an RMSE score of 17.93 was achieved on the test data set. The top five models from the contest, on average, beat the performance of our benchmark model by 27% in the RMSE score. In the paper, we will review these five solutions, including preprocess techniques and different machine-learning models, including neural network, long short-term memory (LSTM), and ensemble trees. We found that data cleaning and clustering were critical for improving the performance in all models.

Get full-text (via PubEx)

P009 Disease classifier and microbial dysbiosis index tools cross-predict various pathogenic conditions due to general microbial signal

Journal of Crohn s and Colitis ◽

10.1093/ecco-jcc/jjab076.138 ◽

2021 ◽

Vol 15 (Supplement_1) ◽

pp. S129-S129

Author(s):

H Abbas Egbariya ◽

T Braun ◽

R Hadar ◽

O Gal-Mor ◽

N Shental ◽

...

Keyword(s):

Random Forest ◽

Disease Control ◽

Control Cohort ◽

Sequencing Data ◽

Microbial Dysbiosis ◽

Trained Classifier ◽

The Status ◽

Inflammatory Bowel ◽

General Response ◽

Log Ratio

Abstract Background Microbial dysbiosis is widely described in inflammatory bowel disease (IBD), and has been shown to predict IBD state. However, many other diseases including neuro-psychiatric, metabolic, and malignancies, most of which do not result in gut inflammation, are also linked with gut microbial alteration. Since most studies focus on a single disease, the extent of similarity between different diseases is usually not examined. Methods We reanalyzed raw sequencing data from 12,838 human gut V4 16Sseq samples, spanning 59 case-controls comparisons and 28 unique diseases. Novel statistical approach was applied to reduce the effect of the different cohorts; all samples were processed uniformly, and differentially expressed amplicon sequence variants (ASVs) were identified within each cohort. The resulting behavior (direction of change and effect size) of each ASV were then combined across all studies. We used random forest as our classifier and generated non-specific dysbiosis index (NSDI). Results For the disease prediction, each cohort was randomly subsampled to 23 healthy and 23 disease samples. Random forest classifier was trained on one disease/control cohort, and the trained classifier was then used to predict the status of a different disease/control cohort. Disease classifiers performed well in identifying many sick vs. healthy states but failed to differentiate between different diseases. For example, a classifier trained on IBD cohort classified relatively good also disease/control in lupus, schizophrenia, or Parkinson’s from different cohorts. We show this cross-identification is due to a large number of shared disease-associated bacteria and utilize these bacteria to define a novel non-specific dysbiosis index (NSDI). After, we identified 114 non-disease specific ASVs (86 up and 28 down regulated ASVs across diseases in comparison to controls), we calculate the per-sample NSDI by rank-transforming the bacteria within the sample and computing the normalized log ratio of the sum of the ranks of the 86 down and 28 up regulated ASVs. The resulting NSDI is shown to perform better than the previously published CD dysbiosis index (Gevers et al, 2014; PMID: 24629344) indicating that NSDI can successfully differentiate between most cases and controls across a wide variety of diseases. Conclusion A robust non-specific general response of the gut microbiome is detected across different diseases, some of which is shared with IBD. Classifiers trained on a single disease may identify the general non-specific signal and therefore care should be taken when interpreting the classifier predictions. Finally, our NSDI can be used to prioritize the per-sample degree of dysbiosis.

Get full-text (via PubEx)

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Get full-text (via PubEx)

Separating Phages From Other Virus Families and Classifying the Different Phage Families By GI-Clusters

10.21203/rs.3.rs-1130357/v1 ◽

2021 ◽

Author(s):

Xingang Jia ◽

Qiuhong Han ◽

Zuhong Lu

Keyword(s):

Test Data ◽

Cluster Algorithm ◽

Nearest Neighbors ◽

Euclidean Algorithm ◽

Training Data ◽

Data Sets ◽

Maximum Element ◽

Clustering Techniques ◽

Cluster Algorithms ◽

Biological Entities

Abstract Background: Phages are the most abundant biological entities, but the commonly used clustering techniques are difficult to separate them from other virus families and classify the different phage families together.Results: This work uses GI-clusters to separate phages from other virus families and classify the different phage families, where GI-clusters are constructed by GI-features, GI-features are constructed by the togetherness with F-features, training data, MG-Euclidean and Icc-cluster algorithms, F-features are the frequencies of multiple-nucleotides that are generated from genomes of viruses, MG-Euclidean algorithm is able to put the nearest neighbors in the same mini-groups, and Icc-cluster algorithm put the distant samples to the different mini-clusters. For these viruses that the maximum element of their GI-features are in the same locations, they are put to the same GI-clusters, where the families of viruses in test data are identified by GI-clusters, and the families of GI-clusters are defined by viruses of training data.Conclusions: From analysis of 4 data sets that are constructed by the different family viruses, we demonstrate that GI-clusters are able to separate phages from other virus families, correctly classify the different phage families, and correctly predict the families of these unknown phages also.

Get full-text (via PubEx)