Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine-learning toolbox

AbstractThe human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This revealed some biomarkers to be disease-specific, others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de.

Download Full-text

Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox

Genome Biology ◽

10.1186/s13059-021-02306-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jakob Wirbel ◽

Konrad Zych ◽

Morgan Essex ◽

Nicolai Karcher ◽

Ece Kartal ◽

...

Keyword(s):

Machine Learning ◽

Meta Analysis ◽

Human Microbiome ◽

Training Set ◽

Comparative Metagenomics ◽

Link Type ◽

Augmentation Strategy ◽

Disease Specificity ◽

Disease Specific ◽

Multiple Conditions

AbstractThe human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce, and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This reveals some biomarkers to be disease-specific, with others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de.

Download Full-text

Integrative transcriptomic, proteomic, and machine learning approach to identifying feature genes of atrial fibrillation using atrial samples from patients with valvular heart disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-020-01819-0 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Yaozhong Liu ◽

Fan Bai ◽

Zhenwei Tang ◽

Na Liu ◽

Qiming Liu

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Heart Disease ◽

Valvular Heart Disease ◽

Meta Analysis ◽

Learning Approach ◽

Training Set ◽

Test Set ◽

Validation Test ◽

Machine Learning Approach

Abstract Background Atrial fibrillation (AF) is the most common arrhythmia with poorly understood mechanisms. We aimed to investigate the biological mechanism of AF and to discover feature genes by analyzing multi-omics data and by applying a machine learning approach. Methods At the transcriptomic level, four microarray datasets (GSE41177, GSE79768, GSE115574, GSE14975) were downloaded from the Gene Expression Omnibus database, which included 130 available atrial samples from AF and sinus rhythm (SR) patients with valvular heart disease. Microarray meta-analysis was adopted to identified differentially expressed genes (DEGs). At the proteomic level, a qualitative and quantitative analysis of proteomics in the left atrial appendage of 18 patients (9 with AF and 9 with SR) who underwent cardiac valvular surgery was conducted. The machine learning correlation-based feature selection (CFS) method was introduced to selected feature genes of AF using the training set of 130 samples involved in the microarray meta-analysis. The Naive Bayes (NB) based classifier constructed using training set was evaluated on an independent validation test set GSE2240. Results 863 DEGs with FDR < 0.05 and 482 differentially expressed proteins (DEPs) with FDR < 0.1 and fold change > 1.2 were obtained from the transcriptomic and proteomic study, respectively. The DEGs and DEPs were then analyzed together which identified 30 biomarkers with consistent trends. Further, 10 features, including 8 upregulated genes (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 downregulated genes (TNNI1, TRDN) were selected from the 30 biomarkers through machine learning CFS method using training set. The NB based classifier constructed using the training set accurately and reliably classify AF from SR samples in the validation test set with a precision of 87.5% and AUC of 0.995. Conclusion Taken together, our present work might provide novel insights into the molecular mechanism and provide some promising diagnostic and therapeutic targets of AF.

Download Full-text

Integrative transcriptomic, proteomic, and machine learning approach to identifying feature genes of atrial fibrillation

10.21203/rs.3.rs-104305/v1 ◽

2020 ◽

Author(s):

Yaozhong Liu ◽

Fan Bai ◽

Zhenwei Tang ◽

Na Liu ◽

Qiming Liu

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Meta Analysis ◽

Gene Expression Omnibus ◽

Differentially Expressed ◽

Learning Approach ◽

Training Set ◽

Test Set ◽

Validation Test ◽

Machine Learning Approach

Abstract Background: Atrial fibrillation (AF) is the most common arrhythmia with poorly understood mechanisms. We aimed to investigate the biological mechanism of AF and to discover feature genes by analyzing multi-omics data and by applying a machine learning approach. Methods: At the transcriptomic level, four microarray datasets (GSE41177, GSE79768, GSE115574, GSE14975) were downloaded from the Gene Expression Omnibus database, which included 130 available atrial samples from AF and sinus rhythm (SR) group. Microarray meta-analysis was adopted to identified differentially expressed genes (DEGs). At the proteomic level, a qualitative and quantitative analysis of proteomics in the left atrial appendage of 18 patients (9 with AF and 9 with SR) was conducted. The machine learning correlation-based feature selection (CSF) method was introduced to selected feature genes of AF using the training set of 130 samples involved in the microarray meta-analysis. The Naive Bayes (NB) based classifier constructed using training set was evaluated on an independent validation test set GSE2240. Results: 863 DEGs with a FDR<0.05 and 482 differentially expressed proteins (DEPs) with a FDR<0.1 and fold change >1.2 were obtained from the transcriptomic and proteomic study, respectively. The DEGs and DEPs were then analyzed together which identified 30 biomarkers with consistent trends. Further, 10 feature, including 8 upregulated genes (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 downregulated genes (TNNT1, TRDN) were selected from the 30 biomarkers through machine learning CFS method using training set. The NB based classifier constructed using the training set accurately and reliably classify AF from SR samples in the validation test set with a precision of 87.5% and AUC of 0.995.Conclusion: Taken together, our present work might provide novel insights into the molecular mechanism and provide some promising diagnostic and therapeutic targets of AF.

Download Full-text

389-P: Ability for Detecting or Predicting Hypoglycemia with the Aid of Machine Learning Techniques: A Meta-analysis

Diabetes ◽

10.2337/db20-389-p ◽

2020 ◽

Vol 69 (Supplement 1) ◽

pp. 389-P

Author(s):

SATORU KODAMA ◽

MAYUKO H. YAMADA ◽

YUTA YAGUCHI ◽

MASARU KITAZAWA ◽

MASANORI KANEKO ◽

...

Keyword(s):

Machine Learning ◽

Meta Analysis ◽

Machine Learning Techniques ◽

Learning Techniques

Download Full-text

Diagnostic test accuracy for use of machine learning in diagnosis of autism spectrum disorder: A Systematic Review and Meta-Analysis (Preprint)

10.2196/preprints.14108 ◽

2019 ◽

Author(s):

Sun Jae Moon ◽

Jin Seub Hwang ◽

Rajesh Kana ◽

John Torous ◽

Jung Won Kim

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Autism Spectrum Disorder ◽

Meta Analysis ◽

Learning Algorithms ◽

Structural Mri ◽

Autism Spectrum ◽

Machine Learning Algorithms ◽

Spectrum Disorder ◽

Test Accuracy

BACKGROUND Over the recent years, machine learning algorithms have been more widely and increasingly applied in biomedical fields. In particular, its application has been drawing more attention in the field of psychiatry, for instance, as diagnostic tests/tools for autism spectrum disorder. However, given its complexity and potential clinical implications, there is ongoing need for further research on its accuracy. OBJECTIVE The current study aims to summarize the evidence for the accuracy of use of machine learning algorithms in diagnosing autism spectrum disorder (ASD) through systematic review and meta-analysis. METHODS MEDLINE, Embase, CINAHL Complete (with OpenDissertations), PsyINFO and IEEE Xplore Digital Library databases were searched on November 28th, 2018. Studies, which used a machine learning algorithm partially or fully in classifying ASD from controls and provided accuracy measures, were included in our analysis. Bivariate random effects model was applied to the pooled data in meta-analysis. Subgroup analysis was used to investigate and resolve the source of heterogeneity between studies. True-positive, false-positive, false negative and true-negative values from individual studies were used to calculate the pooled sensitivity and specificity values, draw SROC curves, and obtain area under the curve (AUC) and partial AUC. RESULTS A total of 43 studies were included for the final analysis, of which meta-analysis was performed on 40 studies (53 samples with 12,128 participants). A structural MRI subgroup meta-analysis (12 samples with 1,776 participants) showed the sensitivity at 0.83 (95% CI-0.76 to 0.89), specificity at 0.84 (95% CI -0.74 to 0.91), and AUC/pAUC at 0.90/0.83. An fMRI/deep neural network (DNN) subgroup meta-analysis (five samples with 1,345 participants) showed the sensitivity at 0.69 (95% CI- 0.62 to 0.75), the specificity at 0.66 (95% CI -0.61 to 0.70), and AUC/pAUC at 0.71/0.67. CONCLUSIONS Machine learning algorithms that used structural MRI features in diagnosis of ASD were shown to have accuracy that is similar to currently used diagnostic tools.

Download Full-text

Text Classification in Clinical Practice Guidelines Using Machine-Learning Assisted Pattern-Based Approach

Applied Sciences ◽

10.3390/app11083296 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3296

Author(s):

Musarrat Hussain ◽

Jamil Hussain ◽

Taqdir Ali ◽

Syed Imran Ali ◽

Hafiz Syed Muhammad Bilal ◽

...

Keyword(s):

Machine Learning ◽

Decision Making ◽

Clinical Practice ◽

Clinical Practice Guidelines ◽

Practice Guidelines ◽

Machine Learning Algorithms ◽

Nominal Group ◽

Specific Information ◽

Matching Techniques ◽

Disease Specific

Clinical Practice Guidelines (CPGs) aim to optimize patient care by assisting physicians during the decision-making process. However, guideline adherence is highly affected by its unstructured format and aggregation of background information with disease-specific information. The objective of our study is to extract disease-specific information from CPG for enhancing its adherence ratio. In this research, we propose a semi-automatic mechanism for extracting disease-specific information from CPGs using pattern-matching techniques. We apply supervised and unsupervised machine-learning algorithms on CPG to extract a list of salient terms contributing to distinguishing recommendation sentences (RS) from non-recommendation sentences (NRS). Simultaneously, a group of experts also analyzes the same CPG and extract the initial patterns “Heuristic Patterns” using a group decision-making method, nominal group technique (NGT). We provide the list of salient terms to the experts and ask them to refine their extracted patterns. The experts refine patterns considering the provided salient terms. The extracted heuristic patterns depend on specific terms and suffer from the specialization problem due to synonymy and polysemy. Therefore, we generalize the heuristic patterns to part-of-speech (POS) patterns and unified medical language system (UMLS) patterns, which make the proposed method generalize for all types of CPGs. We evaluated the initial extracted patterns on asthma, rhinosinusitis, and hypertension guidelines with the accuracy of 76.92%, 84.63%, and 89.16%, respectively. The accuracy increased to 78.89%, 85.32%, and 92.07% with refined machine-learning assistive patterns, respectively. Our system assists physicians by locating disease-specific information in the CPGs, which enhances the physicians’ performance and reduces CPG processing time. Additionally, it is beneficial in CPGs content annotation.

Download Full-text

Data Mining-based Financial Statement Fraud Detection: Systematic Literature Review and Meta-analysis to Estimate Data Sample Mapping of Fraudulent Companies Against Non-fraudulent Companies

Global Business Review ◽

10.1177/0972150920984857 ◽

2021 ◽

pp. 097215092098485

Author(s):

Sonika Gupta ◽

Sushil Kumar Mehta

Keyword(s):

Machine Learning ◽

Data Mining ◽

Literature Review ◽

Systematic Literature Review ◽

Classification Accuracy ◽

Meta Analysis ◽

Financial Statement ◽

Research Articles ◽

Financial Statement Fraud ◽

Data Mining Techniques

Data mining techniques have proven quite effective not only in detecting financial statement frauds but also in discovering other financial crimes, such as credit card frauds, loan and security frauds, corporate frauds, bank and insurance frauds, etc. Classification of data mining techniques, in recent years, has been accepted as one of the most credible methodologies for the detection of symptoms of financial statement frauds through scanning the published financial statements of companies. The retrieved literature that has used data mining classification techniques can be broadly categorized on the basis of the type of technique applied, as statistical techniques and machine learning techniques. The biggest challenge in executing the classification process using data mining techniques lies in collecting the data sample of fraudulent companies and mapping the sample of fraudulent companies against non-fraudulent companies. In this article, a systematic literature review (SLR) of studies from the area of financial statement fraud detection has been conducted. The review has considered research articles published between 1995 and 2020. Further, a meta-analysis has been performed to establish the effect of data sample mapping of fraudulent companies against non-fraudulent companies on the classification methods through comparing the overall classification accuracy reported in the literature. The retrieved literature indicates that a fraudulent sample can either be equally paired with non-fraudulent sample (1:1 data mapping) or be unequally mapped using 1:many ratio to increase the sample size proportionally. Based on the meta-analysis of the research articles, it can be concluded that machine learning approaches, in comparison to statistical approaches, can achieve better classification accuracy, particularly when the availability of sample data is low. High classification accuracy can be obtained with even a 1:1 mapping data set using machine learning classification approaches.

Download Full-text

Inter-Experiment Machine Learning on APT experiments: New Insights from Meta-Analysis

Microscopy and Microanalysis ◽

10.1017/s1431927621001264 ◽

2021 ◽

Vol 27 (S1) ◽

pp. 182-183

Author(s):

Martin Meier ◽

Paul Bagot ◽

Michael Moody ◽

Daniel Haley

Keyword(s):

Machine Learning ◽

Meta Analysis

Download Full-text

Classification of multiwavelength transients with Machine Learning

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3873 ◽

2020 ◽

Author(s):

K Sooknunan ◽

M Lochner ◽

Bruce A Bassett ◽

H V Peiris ◽

R Fender ◽

...

Keyword(s):

Machine Learning ◽

Small Sample ◽

Light Curves ◽

Machine Learning Techniques ◽

Optical Data ◽

Test Time ◽

Test Accuracy ◽

Training Set ◽

The Impact

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.

Download Full-text

Meta-analysis cum machine learning approaches address the structure and biogeochemical potential of marine copepod associated bacteriobiomes

Scientific Reports ◽

10.1038/s41598-021-82482-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Balamurugan Sadaiappan ◽

Chinnamani PrasannaKumar ◽

V. Uthara Nambiar ◽

Mahendran Subramanian ◽

Manguesh U. Gauns

Keyword(s):

Machine Learning ◽

Iron Transport ◽

Meta Analysis ◽

Zooplankton Community ◽

Gradient Boosting ◽

Learning Approaches ◽

16S Rdna Gene ◽

Marine Copepod ◽

Form Of Life ◽

First Time

AbstractCopepods are the dominant members of the zooplankton community and the most abundant form of life. It is imperative to obtain insights into the copepod-associated bacteriobiomes (CAB) in order to identify specific bacterial taxa associated within a copepod, and to understand how they vary between different copepods. Analysing the potential genes within the CAB may reveal their intrinsic role in biogeochemical cycles. For this, machine-learning models and PICRUSt2 analysis were deployed to analyse 16S rDNA gene sequences (approximately 16 million reads) of CAB belonging to five different copepod genera viz., Acartia spp., Calanus spp., Centropages sp., Pleuromamma spp., and Temora spp.. Overall, we predict 50 sub-OTUs (s-OTUs) (gradient boosting classifiers) to be important in five copepod genera. Among these, 15 s-OTUs were predicted to be important in Calanus spp. and 20 s-OTUs as important in Pleuromamma spp.. Four bacterial s-OTUs Acinetobacter johnsonii, Phaeobacter, Vibrio shilonii and Piscirickettsiaceae were identified as important s-OTUs in Calanus spp., and the s-OTUs Marinobacter, Alteromonas, Desulfovibrio, Limnobacter, Sphingomonas, Methyloversatilis, Enhydrobacter and Coriobacteriaceae were predicted as important s-OTUs in Pleuromamma spp., for the first time. Our meta-analysis revealed that the CAB of Pleuromamma spp. had a high proportion of potential genes responsible for methanogenesis and nitrogen fixation, whereas the CAB of Temora spp. had a high proportion of potential genes involved in assimilatory sulphate reduction, and cyanocobalamin synthesis. The CAB of Pleuromamma spp. and Temora spp. have potential genes accountable for iron transport.

Download Full-text