scholarly journals Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine-learning toolbox

Author(s):  
Jakob Wirbel ◽  
Konrad Zych ◽  
Morgan Essex ◽  
Nicolai Karcher ◽  
Ece Kartal ◽  
...  

AbstractThe human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This revealed some biomarkers to be disease-specific, others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jakob Wirbel ◽  
Konrad Zych ◽  
Morgan Essex ◽  
Nicolai Karcher ◽  
Ece Kartal ◽  
...  

AbstractThe human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce, and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This reveals some biomarkers to be disease-specific, with others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Yaozhong Liu ◽  
Fan Bai ◽  
Zhenwei Tang ◽  
Na Liu ◽  
Qiming Liu

Abstract Background Atrial fibrillation (AF) is the most common arrhythmia with poorly understood mechanisms. We aimed to investigate the biological mechanism of AF and to discover feature genes by analyzing multi-omics data and by applying a machine learning approach. Methods At the transcriptomic level, four microarray datasets (GSE41177, GSE79768, GSE115574, GSE14975) were downloaded from the Gene Expression Omnibus database, which included 130 available atrial samples from AF and sinus rhythm (SR) patients with valvular heart disease. Microarray meta-analysis was adopted to identified differentially expressed genes (DEGs). At the proteomic level, a qualitative and quantitative analysis of proteomics in the left atrial appendage of 18 patients (9 with AF and 9 with SR) who underwent cardiac valvular surgery was conducted. The machine learning correlation-based feature selection (CFS) method was introduced to selected feature genes of AF using the training set of 130 samples involved in the microarray meta-analysis. The Naive Bayes (NB) based classifier constructed using training set was evaluated on an independent validation test set GSE2240. Results 863 DEGs with FDR < 0.05 and 482 differentially expressed proteins (DEPs) with FDR < 0.1 and fold change > 1.2 were obtained from the transcriptomic and proteomic study, respectively. The DEGs and DEPs were then analyzed together which identified 30 biomarkers with consistent trends. Further, 10 features, including 8 upregulated genes (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 downregulated genes (TNNI1, TRDN) were selected from the 30 biomarkers through machine learning CFS method using training set. The NB based classifier constructed using the training set accurately and reliably classify AF from SR samples in the validation test set with a precision of 87.5% and AUC of 0.995. Conclusion Taken together, our present work might provide novel insights into the molecular mechanism and provide some promising diagnostic and therapeutic targets of AF.


2020 ◽  
Author(s):  
Yaozhong Liu ◽  
Fan Bai ◽  
Zhenwei Tang ◽  
Na Liu ◽  
Qiming Liu

Abstract Background: Atrial fibrillation (AF) is the most common arrhythmia with poorly understood mechanisms. We aimed to investigate the biological mechanism of AF and to discover feature genes by analyzing multi-omics data and by applying a machine learning approach. Methods: At the transcriptomic level, four microarray datasets (GSE41177, GSE79768, GSE115574, GSE14975) were downloaded from the Gene Expression Omnibus database, which included 130 available atrial samples from AF and sinus rhythm (SR) group. Microarray meta-analysis was adopted to identified differentially expressed genes (DEGs). At the proteomic level, a qualitative and quantitative analysis of proteomics in the left atrial appendage of 18 patients (9 with AF and 9 with SR) was conducted. The machine learning correlation-based feature selection (CSF) method was introduced to selected feature genes of AF using the training set of 130 samples involved in the microarray meta-analysis. The Naive Bayes (NB) based classifier constructed using training set was evaluated on an independent validation test set GSE2240. Results: 863 DEGs with a FDR<0.05 and 482 differentially expressed proteins (DEPs) with a FDR<0.1 and fold change >1.2 were obtained from the transcriptomic and proteomic study, respectively. The DEGs and DEPs were then analyzed together which identified 30 biomarkers with consistent trends. Further, 10 feature, including 8 upregulated genes (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 downregulated genes (TNNT1, TRDN) were selected from the 30 biomarkers through machine learning CFS method using training set. The NB based classifier constructed using the training set accurately and reliably classify AF from SR samples in the validation test set with a precision of 87.5% and AUC of 0.995.Conclusion: Taken together, our present work might provide novel insights into the molecular mechanism and provide some promising diagnostic and therapeutic targets of AF.


Diabetes ◽  
2020 ◽  
Vol 69 (Supplement 1) ◽  
pp. 389-P
Author(s):  
SATORU KODAMA ◽  
MAYUKO H. YAMADA ◽  
YUTA YAGUCHI ◽  
MASARU KITAZAWA ◽  
MASANORI KANEKO ◽  
...  

2019 ◽  
Author(s):  
Sun Jae Moon ◽  
Jin Seub Hwang ◽  
Rajesh Kana ◽  
John Torous ◽  
Jung Won Kim

BACKGROUND Over the recent years, machine learning algorithms have been more widely and increasingly applied in biomedical fields. In particular, its application has been drawing more attention in the field of psychiatry, for instance, as diagnostic tests/tools for autism spectrum disorder. However, given its complexity and potential clinical implications, there is ongoing need for further research on its accuracy. OBJECTIVE The current study aims to summarize the evidence for the accuracy of use of machine learning algorithms in diagnosing autism spectrum disorder (ASD) through systematic review and meta-analysis. METHODS MEDLINE, Embase, CINAHL Complete (with OpenDissertations), PsyINFO and IEEE Xplore Digital Library databases were searched on November 28th, 2018. Studies, which used a machine learning algorithm partially or fully in classifying ASD from controls and provided accuracy measures, were included in our analysis. Bivariate random effects model was applied to the pooled data in meta-analysis. Subgroup analysis was used to investigate and resolve the source of heterogeneity between studies. True-positive, false-positive, false negative and true-negative values from individual studies were used to calculate the pooled sensitivity and specificity values, draw SROC curves, and obtain area under the curve (AUC) and partial AUC. RESULTS A total of 43 studies were included for the final analysis, of which meta-analysis was performed on 40 studies (53 samples with 12,128 participants). A structural MRI subgroup meta-analysis (12 samples with 1,776 participants) showed the sensitivity at 0.83 (95% CI-0.76 to 0.89), specificity at 0.84 (95% CI -0.74 to 0.91), and AUC/pAUC at 0.90/0.83. An fMRI/deep neural network (DNN) subgroup meta-analysis (five samples with 1,345 participants) showed the sensitivity at 0.69 (95% CI- 0.62 to 0.75), the specificity at 0.66 (95% CI -0.61 to 0.70), and AUC/pAUC at 0.71/0.67. CONCLUSIONS Machine learning algorithms that used structural MRI features in diagnosis of ASD were shown to have accuracy that is similar to currently used diagnostic tools.


2021 ◽  
Vol 11 (8) ◽  
pp. 3296
Author(s):  
Musarrat Hussain ◽  
Jamil Hussain ◽  
Taqdir Ali ◽  
Syed Imran Ali ◽  
Hafiz Syed Muhammad Bilal ◽  
...  

Clinical Practice Guidelines (CPGs) aim to optimize patient care by assisting physicians during the decision-making process. However, guideline adherence is highly affected by its unstructured format and aggregation of background information with disease-specific information. The objective of our study is to extract disease-specific information from CPG for enhancing its adherence ratio. In this research, we propose a semi-automatic mechanism for extracting disease-specific information from CPGs using pattern-matching techniques. We apply supervised and unsupervised machine-learning algorithms on CPG to extract a list of salient terms contributing to distinguishing recommendation sentences (RS) from non-recommendation sentences (NRS). Simultaneously, a group of experts also analyzes the same CPG and extract the initial patterns “Heuristic Patterns” using a group decision-making method, nominal group technique (NGT). We provide the list of salient terms to the experts and ask them to refine their extracted patterns. The experts refine patterns considering the provided salient terms. The extracted heuristic patterns depend on specific terms and suffer from the specialization problem due to synonymy and polysemy. Therefore, we generalize the heuristic patterns to part-of-speech (POS) patterns and unified medical language system (UMLS) patterns, which make the proposed method generalize for all types of CPGs. We evaluated the initial extracted patterns on asthma, rhinosinusitis, and hypertension guidelines with the accuracy of 76.92%, 84.63%, and 89.16%, respectively. The accuracy increased to 78.89%, 85.32%, and 92.07% with refined machine-learning assistive patterns, respectively. Our system assists physicians by locating disease-specific information in the CPGs, which enhances the physicians’ performance and reduces CPG processing time. Additionally, it is beneficial in CPGs content annotation.


2021 ◽  
pp. 097215092098485
Author(s):  
Sonika Gupta ◽  
Sushil Kumar Mehta

Data mining techniques have proven quite effective not only in detecting financial statement frauds but also in discovering other financial crimes, such as credit card frauds, loan and security frauds, corporate frauds, bank and insurance frauds, etc. Classification of data mining techniques, in recent years, has been accepted as one of the most credible methodologies for the detection of symptoms of financial statement frauds through scanning the published financial statements of companies. The retrieved literature that has used data mining classification techniques can be broadly categorized on the basis of the type of technique applied, as statistical techniques and machine learning techniques. The biggest challenge in executing the classification process using data mining techniques lies in collecting the data sample of fraudulent companies and mapping the sample of fraudulent companies against non-fraudulent companies. In this article, a systematic literature review (SLR) of studies from the area of financial statement fraud detection has been conducted. The review has considered research articles published between 1995 and 2020. Further, a meta-analysis has been performed to establish the effect of data sample mapping of fraudulent companies against non-fraudulent companies on the classification methods through comparing the overall classification accuracy reported in the literature. The retrieved literature indicates that a fraudulent sample can either be equally paired with non-fraudulent sample (1:1 data mapping) or be unequally mapped using 1:many ratio to increase the sample size proportionally. Based on the meta-analysis of the research articles, it can be concluded that machine learning approaches, in comparison to statistical approaches, can achieve better classification accuracy, particularly when the availability of sample data is low. High classification accuracy can be obtained with even a 1:1 mapping data set using machine learning classification approaches.


2021 ◽  
Vol 27 (S1) ◽  
pp. 182-183
Author(s):  
Martin Meier ◽  
Paul Bagot ◽  
Michael Moody ◽  
Daniel Haley

Author(s):  
K Sooknunan ◽  
M Lochner ◽  
Bruce A Bassett ◽  
H V Peiris ◽  
R Fender ◽  
...  

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Balamurugan Sadaiappan ◽  
Chinnamani PrasannaKumar ◽  
V. Uthara Nambiar ◽  
Mahendran Subramanian ◽  
Manguesh U. Gauns

AbstractCopepods are the dominant members of the zooplankton community and the most abundant form of life. It is imperative to obtain insights into the copepod-associated bacteriobiomes (CAB) in order to identify specific bacterial taxa associated within a copepod, and to understand how they vary between different copepods. Analysing the potential genes within the CAB may reveal their intrinsic role in biogeochemical cycles. For this, machine-learning models and PICRUSt2 analysis were deployed to analyse 16S rDNA gene sequences (approximately 16 million reads) of CAB belonging to five different copepod genera viz., Acartia spp., Calanus spp., Centropages sp., Pleuromamma spp., and Temora spp.. Overall, we predict 50 sub-OTUs (s-OTUs) (gradient boosting classifiers) to be important in five copepod genera. Among these, 15 s-OTUs were predicted to be important in Calanus spp. and 20 s-OTUs as important in Pleuromamma spp.. Four bacterial s-OTUs Acinetobacter johnsonii, Phaeobacter, Vibrio shilonii and Piscirickettsiaceae were identified as important s-OTUs in Calanus spp., and the s-OTUs Marinobacter, Alteromonas, Desulfovibrio, Limnobacter, Sphingomonas, Methyloversatilis, Enhydrobacter and Coriobacteriaceae were predicted as important s-OTUs in Pleuromamma spp., for the first time. Our meta-analysis revealed that the CAB of Pleuromamma spp. had a high proportion of potential genes responsible for methanogenesis and nitrogen fixation, whereas the CAB of Temora spp. had a high proportion of potential genes involved in assimilatory sulphate reduction, and cyanocobalamin synthesis. The CAB of Pleuromamma spp. and Temora spp. have potential genes accountable for iron transport.


Sign in / Sign up

Export Citation Format

Share Document