scholarly journals Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jakob Wirbel ◽  
Konrad Zych ◽  
Morgan Essex ◽  
Nicolai Karcher ◽  
Ece Kartal ◽  
...  

AbstractThe human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce, and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This reveals some biomarkers to be disease-specific, with others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de.

Author(s):  
Jakob Wirbel ◽  
Konrad Zych ◽  
Morgan Essex ◽  
Nicolai Karcher ◽  
Ece Kartal ◽  
...  

AbstractThe human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This revealed some biomarkers to be disease-specific, others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de.


2021 ◽  
Vol 70 (11) ◽  
Author(s):  
Wenjia Liu ◽  
Nanjiao Ying ◽  
Qiusi Mo ◽  
Shanshan Li ◽  
Mengjie Shao ◽  
...  

Introduction. Klebsiella pneumoniae , a gram-negative bacterium, is a common pathogen causing nosocomial infection. The drug-resistance rate of K. pneumoniae is increasing year by year, posing a severe threat to public health worldwide. K. pneumoniae has been listed as one of the pathogens causing the global crisis of antimicrobial resistance in nosocomial infections. We need to explore the drug resistance of K. pneumoniae for clinical diagnosis. Single nucleotide polymorphisms (SNPs) are of high density and have rich genetic information in whole-genome sequencing (WGS), which can affect the structure or expression of proteins. SNPs can be used to explore mutation sites associated with bacterial resistance. Hypothesis/Gap Statement. Machine learning methods can detect genetic features associated with the drug resistance of K. pneumoniae from whole-genome SNP data. Aims. This work used Fast Feature Selection (FFS) and Codon Mutation Detection (CMD) machine learning methods to detect genetic features related to drug resistance of K. pneumoniae from whole-genome SNP data. Methods. WGS data on resistance of K. pneumoniae strains to four antibiotics (tetracycline, gentamicin, imipenem, amikacin) were downloaded from the European Nucleotide Archive (ENA). Sequence alignments were performed with MUMmer 3 to complete SNP calling using K. pneumoniae HS11286 chromosome as the reference genome. The FFS algorithm was applied to feature selection of the SNP dataset. The training set was constructed based on mutation sites with mutation frequency >0.995. Based on the original SNP training set, 70% of SNPs were randomly selected from each dataset as the test set to verify the accuracy of the training results. Finally, the resistance genes were obtained by the CMD algorithm and Venny. Results. The number of strains resistant to tetracycline, gentamicin, imipenem and amikacin was 931, 1048, 789 and 203, respectively. Machine learning algorithms were applied to the SNP training set and test set, and 28 and 23 resistance genes were predicted, respectively. The 28 resistance genes in the training set included 22 genes in the test set, which verified the accuracy of gene prediction. Among them, some genes (KPHS_35310, KPHS_18220, KPHS_35880, etc.) corresponded to known resistance genes (Eef2, lpxK, MdtC, etc). Logistic regression classifiers were established based on the identified SNPs in the training set. The area under the curves (AUCs) of the four antibiotics was 0.939, 0.950, 0.912 and 0.935, showing a strong ability to predict bacterial resistance. Conclusion. Machine learning methods can effectively be used to predict resistance genes and associated SNPs. The FFS and CMD algorithms have wide applicability. They can be used for the drug-resistance analysis of any microorganism with genomic variation and phenotypic data. This work lays a foundation for resistance research in clinical applications.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Flavio Pazos Obregón ◽  
Martín Palazzo ◽  
Pablo Soto ◽  
Gustavo Guerberoff ◽  
Patricio Yankilevich ◽  
...  

Abstract Background Assembly and function of neuronal synapses require the coordinated expression of a yet undetermined set of genes. Previously, we had trained an ensemble machine learning model to assign a probability of having synaptic function to every protein-coding gene in Drosophila melanogaster. This approach resulted in the publication of a catalogue of 893 genes which we postulated to be very enriched in genes with a still undocumented synaptic function. Since then, the scientific community has experimentally identified 79 new synaptic genes. Here we use these new empirical data to evaluate our original prediction. We also implement a series of changes to the training scheme of our model and using the new data we demonstrate that this improves its predictive power. Finally, we added the new synaptic genes to the training set and trained a new model, obtaining a new, enhanced catalogue of putative synaptic genes. Results The retrospective analysis demonstrate that our original catalogue was significantly enriched in new synaptic genes. When the changes to the training scheme were implemented using the original training set we obtained even higher enrichment. Finally, applying the new training scheme with a training set including the 79 new synaptic genes, resulted in an enhanced catalogue of putative synaptic genes. Here we present this new catalogue and announce that a regularly updated version will be available online at: http://synapticgenes.bnd.edu.uy Conclusions We show that training an ensemble of machine learning classifiers solely with the whole-body temporal transcription profiles of known synaptic genes resulted in a catalogue with a significant enrichment in undiscovered synaptic genes. Using new empirical data provided by the scientific community, we validated our original approach, improved our model an obtained an arguably more precise prediction. This approach reduces the number of genes to be tested through hypothesis-driven experimentation and will facilitate our understanding of neuronal function. Availability http://synapticgenes.bnd.edu.uy


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Yaozhong Liu ◽  
Fan Bai ◽  
Zhenwei Tang ◽  
Na Liu ◽  
Qiming Liu

Abstract Background Atrial fibrillation (AF) is the most common arrhythmia with poorly understood mechanisms. We aimed to investigate the biological mechanism of AF and to discover feature genes by analyzing multi-omics data and by applying a machine learning approach. Methods At the transcriptomic level, four microarray datasets (GSE41177, GSE79768, GSE115574, GSE14975) were downloaded from the Gene Expression Omnibus database, which included 130 available atrial samples from AF and sinus rhythm (SR) patients with valvular heart disease. Microarray meta-analysis was adopted to identified differentially expressed genes (DEGs). At the proteomic level, a qualitative and quantitative analysis of proteomics in the left atrial appendage of 18 patients (9 with AF and 9 with SR) who underwent cardiac valvular surgery was conducted. The machine learning correlation-based feature selection (CFS) method was introduced to selected feature genes of AF using the training set of 130 samples involved in the microarray meta-analysis. The Naive Bayes (NB) based classifier constructed using training set was evaluated on an independent validation test set GSE2240. Results 863 DEGs with FDR < 0.05 and 482 differentially expressed proteins (DEPs) with FDR < 0.1 and fold change > 1.2 were obtained from the transcriptomic and proteomic study, respectively. The DEGs and DEPs were then analyzed together which identified 30 biomarkers with consistent trends. Further, 10 features, including 8 upregulated genes (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 downregulated genes (TNNI1, TRDN) were selected from the 30 biomarkers through machine learning CFS method using training set. The NB based classifier constructed using the training set accurately and reliably classify AF from SR samples in the validation test set with a precision of 87.5% and AUC of 0.995. Conclusion Taken together, our present work might provide novel insights into the molecular mechanism and provide some promising diagnostic and therapeutic targets of AF.


2021 ◽  
Author(s):  
Ádám Nagy ◽  
Balázs Ligeti ◽  
János Szebeni ◽  
Sándor Pongor ◽  
Balázs Győrffy

ABSTRACTIntroductionNumerous studies demonstrate frequent mutations in the genome of SARS-CoV-2. Our goal was to statistically link mutations to severe disease outcome.MethodsWe used an automated machine learning approach where 1,594 viral genomes with available clinical follow-up data were used as the training set (797 “severe” and 797 “mild”). The best algorithm, based on random forest classification combined with the LASSO feature selection algorithm was employed to the training set to link mutation signatures and outcome. The performance of the final model was estimated by repeated, stratified, 10-fold cross validation (CV), then adjusted for multiple testing with Bootstrap Bias Corrected CV.ResultsWe identified 26 protein and UTR mutations significantly linked to severe outcome. The best classification algorithm uses a mutation signature of 22 mutations as well as the patient’s age as the input and shows high classification efficiency with an AUC of 0.94 (CI: [0.912, 0.962]) and a prediction accuracy of 87% (CI: [0.830, 0.903]). Finally, we established an online platform (https://covidoutcome.com/) which is capable to use a viral sequence and the patient’s age as the input and provides a percentage estimation of disease severity.DiscussionWe demonstrate a statistical association between mutation signatures of SARS-CoV-2 and severe outcome of COVID-19. The established analysis platform enables a real-time analysis of new viral genomes.KEY MESSAGESA statistical link between SARS-Cov-2 mutation status and severe COVID outcome was established using automated machine learning techniques based on random forest and logistic regression combined with feature selection algorithms.A mutation signature based on 3,779 protein coding and 36 UTR mutations capable to identify severe outcome cases was established.The trained model showed high classification performance (AUC=0.94 (CI: [0.912, 0.962]), accuracy=0.87 (CI: [0.830, 0.903])).A registration-free web-server for automated classification of new samples was set up and is accessible at http://www.covidoutcome.com.The established pipeline provides a quick assessment of future patients warranting a prospective clinical validation.


2020 ◽  
Author(s):  
Yaozhong Liu ◽  
Fan Bai ◽  
Zhenwei Tang ◽  
Na Liu ◽  
Qiming Liu

Abstract Background: Atrial fibrillation (AF) is the most common arrhythmia with poorly understood mechanisms. We aimed to investigate the biological mechanism of AF and to discover feature genes by analyzing multi-omics data and by applying a machine learning approach. Methods: At the transcriptomic level, four microarray datasets (GSE41177, GSE79768, GSE115574, GSE14975) were downloaded from the Gene Expression Omnibus database, which included 130 available atrial samples from AF and sinus rhythm (SR) group. Microarray meta-analysis was adopted to identified differentially expressed genes (DEGs). At the proteomic level, a qualitative and quantitative analysis of proteomics in the left atrial appendage of 18 patients (9 with AF and 9 with SR) was conducted. The machine learning correlation-based feature selection (CSF) method was introduced to selected feature genes of AF using the training set of 130 samples involved in the microarray meta-analysis. The Naive Bayes (NB) based classifier constructed using training set was evaluated on an independent validation test set GSE2240. Results: 863 DEGs with a FDR<0.05 and 482 differentially expressed proteins (DEPs) with a FDR<0.1 and fold change >1.2 were obtained from the transcriptomic and proteomic study, respectively. The DEGs and DEPs were then analyzed together which identified 30 biomarkers with consistent trends. Further, 10 feature, including 8 upregulated genes (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 downregulated genes (TNNT1, TRDN) were selected from the 30 biomarkers through machine learning CFS method using training set. The NB based classifier constructed using the training set accurately and reliably classify AF from SR samples in the validation test set with a precision of 87.5% and AUC of 0.995.Conclusion: Taken together, our present work might provide novel insights into the molecular mechanism and provide some promising diagnostic and therapeutic targets of AF.


Diabetes ◽  
2020 ◽  
Vol 69 (Supplement 1) ◽  
pp. 389-P
Author(s):  
SATORU KODAMA ◽  
MAYUKO H. YAMADA ◽  
YUTA YAGUCHI ◽  
MASARU KITAZAWA ◽  
MASANORI KANEKO ◽  
...  

2019 ◽  
Author(s):  
Sun Jae Moon ◽  
Jin Seub Hwang ◽  
Rajesh Kana ◽  
John Torous ◽  
Jung Won Kim

BACKGROUND Over the recent years, machine learning algorithms have been more widely and increasingly applied in biomedical fields. In particular, its application has been drawing more attention in the field of psychiatry, for instance, as diagnostic tests/tools for autism spectrum disorder. However, given its complexity and potential clinical implications, there is ongoing need for further research on its accuracy. OBJECTIVE The current study aims to summarize the evidence for the accuracy of use of machine learning algorithms in diagnosing autism spectrum disorder (ASD) through systematic review and meta-analysis. METHODS MEDLINE, Embase, CINAHL Complete (with OpenDissertations), PsyINFO and IEEE Xplore Digital Library databases were searched on November 28th, 2018. Studies, which used a machine learning algorithm partially or fully in classifying ASD from controls and provided accuracy measures, were included in our analysis. Bivariate random effects model was applied to the pooled data in meta-analysis. Subgroup analysis was used to investigate and resolve the source of heterogeneity between studies. True-positive, false-positive, false negative and true-negative values from individual studies were used to calculate the pooled sensitivity and specificity values, draw SROC curves, and obtain area under the curve (AUC) and partial AUC. RESULTS A total of 43 studies were included for the final analysis, of which meta-analysis was performed on 40 studies (53 samples with 12,128 participants). A structural MRI subgroup meta-analysis (12 samples with 1,776 participants) showed the sensitivity at 0.83 (95% CI-0.76 to 0.89), specificity at 0.84 (95% CI -0.74 to 0.91), and AUC/pAUC at 0.90/0.83. An fMRI/deep neural network (DNN) subgroup meta-analysis (five samples with 1,345 participants) showed the sensitivity at 0.69 (95% CI- 0.62 to 0.75), the specificity at 0.66 (95% CI -0.61 to 0.70), and AUC/pAUC at 0.71/0.67. CONCLUSIONS Machine learning algorithms that used structural MRI features in diagnosis of ASD were shown to have accuracy that is similar to currently used diagnostic tools.


2021 ◽  
Vol 11 (8) ◽  
pp. 3296
Author(s):  
Musarrat Hussain ◽  
Jamil Hussain ◽  
Taqdir Ali ◽  
Syed Imran Ali ◽  
Hafiz Syed Muhammad Bilal ◽  
...  

Clinical Practice Guidelines (CPGs) aim to optimize patient care by assisting physicians during the decision-making process. However, guideline adherence is highly affected by its unstructured format and aggregation of background information with disease-specific information. The objective of our study is to extract disease-specific information from CPG for enhancing its adherence ratio. In this research, we propose a semi-automatic mechanism for extracting disease-specific information from CPGs using pattern-matching techniques. We apply supervised and unsupervised machine-learning algorithms on CPG to extract a list of salient terms contributing to distinguishing recommendation sentences (RS) from non-recommendation sentences (NRS). Simultaneously, a group of experts also analyzes the same CPG and extract the initial patterns “Heuristic Patterns” using a group decision-making method, nominal group technique (NGT). We provide the list of salient terms to the experts and ask them to refine their extracted patterns. The experts refine patterns considering the provided salient terms. The extracted heuristic patterns depend on specific terms and suffer from the specialization problem due to synonymy and polysemy. Therefore, we generalize the heuristic patterns to part-of-speech (POS) patterns and unified medical language system (UMLS) patterns, which make the proposed method generalize for all types of CPGs. We evaluated the initial extracted patterns on asthma, rhinosinusitis, and hypertension guidelines with the accuracy of 76.92%, 84.63%, and 89.16%, respectively. The accuracy increased to 78.89%, 85.32%, and 92.07% with refined machine-learning assistive patterns, respectively. Our system assists physicians by locating disease-specific information in the CPGs, which enhances the physicians’ performance and reduces CPG processing time. Additionally, it is beneficial in CPGs content annotation.


2021 ◽  
pp. 097215092098485
Author(s):  
Sonika Gupta ◽  
Sushil Kumar Mehta

Data mining techniques have proven quite effective not only in detecting financial statement frauds but also in discovering other financial crimes, such as credit card frauds, loan and security frauds, corporate frauds, bank and insurance frauds, etc. Classification of data mining techniques, in recent years, has been accepted as one of the most credible methodologies for the detection of symptoms of financial statement frauds through scanning the published financial statements of companies. The retrieved literature that has used data mining classification techniques can be broadly categorized on the basis of the type of technique applied, as statistical techniques and machine learning techniques. The biggest challenge in executing the classification process using data mining techniques lies in collecting the data sample of fraudulent companies and mapping the sample of fraudulent companies against non-fraudulent companies. In this article, a systematic literature review (SLR) of studies from the area of financial statement fraud detection has been conducted. The review has considered research articles published between 1995 and 2020. Further, a meta-analysis has been performed to establish the effect of data sample mapping of fraudulent companies against non-fraudulent companies on the classification methods through comparing the overall classification accuracy reported in the literature. The retrieved literature indicates that a fraudulent sample can either be equally paired with non-fraudulent sample (1:1 data mapping) or be unequally mapped using 1:many ratio to increase the sample size proportionally. Based on the meta-analysis of the research articles, it can be concluded that machine learning approaches, in comparison to statistical approaches, can achieve better classification accuracy, particularly when the availability of sample data is low. High classification accuracy can be obtained with even a 1:1 mapping data set using machine learning classification approaches.


Sign in / Sign up

Export Citation Format

Share Document