MSIFinder: A python package for detecting MSI status using random forest classifier.

2601 Background: Microsatellite instability (MSI) is a common genomic alteration in several tumors, such as colorectal cancer, endometrial carcinoma, and stomach, which is characterized as microsatellite instability-high (MSI-H) and microsatellite stable (MSS) based on a high degree of polymorphism in microsatellite lengths. MSI is a predictive biomarker for immunotherapy efficacy in advanced/metastatic solid tumors, especially in colorectal cancer (CRC) patients. Several computational approaches based on target panel sequencing data have been used to detect MSI; However, they are considerably affected by the sequencing depth and panel size. Methods: We developed MSIFinder, a python package for automatic MSI classification, using random forest classifier (RFC)-based genome sequencing, which is a machine learning technology. We included 19 MSI-H and 25 MSS samples as training sets. First, RFC model were built by 54 feature markers from the training sets. Second. The software was validated the classifier using a test set comprising 21 MSI-H and 379 MSS samples. Results: With this test set, MSIFinder achieved a sensitivity (recall) of 0.997, a specificity of 1, an accuracy of 0.998, a positive predictive value (PPV) of 0.954, an F1 score of 0.977, and an area under curve (AUC) of 0.999. We discovered that MSIFinder is less affected by low sequencing depth and can achieve a concordance of 0.993, while exhibiting a sequencing depth of 100×. Furthermore, we realized that MSIFinder is less affected by the panel size and can achieve a concordance of 0.99 when the panel size is 0.5 m (million base). Conclusions: These results indicated that MSIFinder is a robust MSI classification tool and not affected by the panel size and sequencing depth. Furthermore, MSIFinder can provide reliable MSI detection for scientific and clinical purposes.[Table: see text]

Download Full-text

MSIFinder: a python package for detecting MSI status using random forest classifier

BMC Bioinformatics ◽

10.1186/s12859-021-03986-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Tao Zhou ◽

Libin Chen ◽

Jing Guo ◽

Mengmeng Zhang ◽

Yanrui Zhang ◽

...

Keyword(s):

Colorectal Cancer ◽

Random Forest ◽

Microsatellite Instability ◽

Solid Tumors ◽

Random Forest Classifier ◽

Sequencing Depth ◽

Test Set ◽

Panel Size ◽

Python Package ◽

Training Sets

Abstract Background Microsatellite instability (MSI) is a common genomic alteration in colorectal cancer, endometrial carcinoma, and other solid tumors. MSI is characterized by a high degree of polymorphism in microsatellite lengths owing to the deficiency in the mismatch repair system. Based on the degree, MSI can be classified as microsatellite instability-high (MSI-H) and microsatellite stable (MSS). MSI is a predictive biomarker for immunotherapy efficacy in advanced/metastatic solid tumors, especially in colorectal cancer patients. Several computational approaches based on target panel sequencing data have been used to detect MSI; however, they are considerably affected by the sequencing depth and panel size. Results We developed MSIFinder, a python package for automatic MSI classification, using random forest classifier (RFC)-based genome sequencing, which is a machine learning technology. We included 19 MSI-H and 25 MSS samples as training sets. First, we selected 54 feature markers from the training sets, built an RFC model, and validated the classifier using a test set comprising 21 MSI-H and 379 MSS samples. With this test set, MSIFinder achieved a sensitivity (recall) of 1.0, a specificity of 0.997, an accuracy of 0.998, a positive predictive value of 0.954, an F1 score of 0.977, and an area under the curve of 0.999. To further verify the robustness and effectiveness of the model, we used a prospective cohort consisting of 18 MSI-H samples and 122 MSS samples. MSIFinder achieved a sensitivity (recall) of 1.0 and a specificity of 1.0. We discovered that MSIFinder is less affected by a low sequencing depth and can achieve a concordance of 0.993 while exhibiting a sequencing depth of 100×. Furthermore, we realized that MSIFinder is less affected by the panel size and can achieve a concordance of 0.99 when the panel size is 0.5 M (million bases). Conclusion These results indicate that MSIFinder is a robust and effective MSI classification tool that can provide reliable MSI detection for scientific and clinical purposes.

Download Full-text

Using Decision Tree Aggregation with Random Forest Model to Identify Gut Microbes Associated with Colorectal Cancer

Genes ◽

10.3390/genes10020112 ◽

2019 ◽

Vol 10 (2) ◽

pp. 112 ◽

Cited By ~ 11

Author(s):

Dongmei Ai ◽

Hongfei Pan ◽

Rongbao Han ◽

Xiaoxin Li ◽

Gang Liu ◽

...

Keyword(s):

Colorectal Cancer ◽

Random Forest ◽

Decision Tree ◽

Scientific Data ◽

Random Forest Classifier ◽

Colorectal Cancers ◽

Human Gut ◽

Gut Microbes ◽

Colorectal Cancer Patients ◽

Consensus Decision

The imbalance of human gut microbiota has been associated with colorectal cancer. In recent years, metagenomics research has provided a large amount of scientific data enabling us to study the dedicated roles of gut microbes in the onset and progression of cancer. We removed unrelated and redundant features during feature selection by mutual information. We then trained a random forest classifier on a large metagenomics dataset of colorectal cancer patients and healthy people assembled from published reports and extracted and analysed the information from the learned decision trees. We identified key microbial species associated with colorectal cancers. These microbes included Porphyromonas asaccharolytica, Peptostreptococcus stomatis, Fusobacterium, Parvimonas sp., Streptococcus vestibularis and Flavonifractor plautii. We obtained the optimal splitting abundance thresholds for these species to distinguish between healthy and colorectal cancer samples. This extracted consensus decision tree may be applied to the diagnosis of colorectal cancers.

Download Full-text

BACPHLIP: Predicting bacteriophage lifestyle from conserved protein domains

10.1101/2020.05.13.094805 ◽

2020 ◽

Author(s):

Adam J. Hockenberry ◽

Claus O. Wilke

Keyword(s):

Random Forest ◽

Protein Domains ◽

Random Forest Classifier ◽

Host Cells ◽

Test Set ◽

Link Type ◽

Independent Test ◽

Host Evolution ◽

Latent Phase

AbstractMotivationBacteriophages are broadly classified into two distinct lifestyles: temperate (lysogenic) and virulent (lytic). Temperate phages are capable of a latent phase of infection within a host cell, whereas virulent phages directly replicate and lyse host cells upon infection. Accurate lifestyle identification is critical for determining the role of individual phage species within ecosystems and their effect on host evolution.ResultsHere, we present BACPHLIP, a BACterioPHage LIfestyle Predictor. BACPHLIP detects the presence of a set of conserved protein domains within an input genome and uses this data to predict lifestyle via a Random Forest classifier. The classifier was trained on 634 phage genomes. On an independent test set of 423 phages, BACPHLIP has an accuracy of 98%, greatly exceeding that of the best existing available tool (79%).AvailabilityBACPHLIP is freely available on GitHub (https://github.com/adamhockenberry/bacphlip) and the code used to build and test the classifier is provided in a separate repository (https://github.com/adamhockenberry/bacphlip-model-dev).

Download Full-text

3D Convolutional Neural Networks for Classification of Alzheimer's and Parkinson's Disease with T1-Weighted Brain MRI

10.1101/2021.07.26.453903 ◽

2021 ◽

Author(s):

Nikhil J. Dhinagar ◽

Sophia I. Thomopoulos ◽

Conor Owens-Walton ◽

Dimitris Stripelis ◽

Jose-Luis Ambite ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Parkinson’S Disease ◽

Alzheimer's Disease ◽

Parkinson's Disease ◽

Random Forest ◽

Brain Mri ◽

Model Performance ◽

Random Forest Classifier ◽

Test Set ◽

3D Cnn

Parkinson's disease (PD) and Alzheimer's disease (AD) are progressive neurodegenerative disorders that affect millions of people worldwide. In this work, we propose a deep learning approach to classify these diseases based on 3D T1-weighted brain MRI. We analyzed several datasets including the Parkinson's Progression Markers Initiative (PPMI), an independent dataset from the University of Pennsylvania School of Medicine (UPenn), the Alzheimer's Disease Neuroimaging Initiative (ADNI), and the Open Access Series of Imaging Studies (OASIS) dataset. PPMI and ADNI were partitioned to train (70%), validate (20%), and test (10%) a 3D convolutional neural network (CNN) for PD and AD classification. The UPenn and OASIS datasets were used as independent test sets to evaluate the model performance during inference. We also implemented a random forest classifier as a baseline model by extracting key radiomics features from the same T1-weighted MRI scans. The proposed 3D CNN model was trained from scratch for the classification tasks. For AD classification, the 3D CNN model achieved an ROC-AUC of 0.878 on the ADNI test set and an average ROC-AUC of 0.789 on the OASIS dataset. For PD classification, the proposed 3D CNN model achieved an ROC-AUC of 0.667 on the PPMI test set and an average ROC-AUC of 0.743 on the UPenn dataset. We also found that model performance was largely maintained when using only 25% of the training dataset. The 3D CNN outperformed the random forest classifier for both the PD and AD tasks. The 3D CNN also generalized better on unseen MRI data from different imaging centers. Our results show that the proposed 3D CNN model was less prone to overfitting for AD than for PD classification. This approach shows promise for screening of PD and AD patients using only T1-weighted brain MRI, which is relatively widely available. This model with additional validation could also be used to help differentiate between challenging cases of AD and PD when they present with similarly subtle motor and non-motor symptoms.

Download Full-text

TUMOUR PATHOLOGY PREDICTS MICROSATELLITE INSTABILITY IN A POPULATION-BASED SERIES OF COLORECTAL CANCER CASES

Clinical & Investigative Medicine ◽

10.25011/cim.v31i4.4807 ◽

2008 ◽

Vol 31 (4) ◽

pp. 12

Author(s):

A J Hyde ◽

D Fontaine ◽

R C Green ◽

M Simms ◽

P S Parfrey ◽

...

Keyword(s):

Colorectal Cancer ◽

Lynch Syndrome ◽

Microsatellite Instability ◽

Population Based ◽

Pathological Features ◽

Predictive Tool ◽

Entire Cohort ◽

Dna Mismatch ◽

Bethesda Guidelines ◽

Revised Bethesda Guidelines

Background: Lynch Syndrome is an autosomal dominant trait that accounts forapproximately 3% of all cases of colorectal cancer (CRC). It is caused by mutations in DNA mismatch repair (MMR) genes, most commonly MLH1 or MSH2. These MMR defects cause high levels of microsatellite instability (MSI-H) in the tumours. MSI testing of all CRCs to identify potential Lynch Syndrome cases is not practical, so the Bethesda Guidelines, which use clinical and pathological features, were created to identify those tumours most likely to be MSI-H^1. In 2007 Jenkins et. al. created MsPath, a tool based on the pathological features described in the rarely used 3^rd Bethesda criterion, to improve prediction of MSI-H tumours among CRC cases diagnosed before age 60 years^2. Methods: We collected a population-based cohort of 716 CRC cases diagnosed before age 75 years in Newfoundland. For each of these cases we collected family history, performed MSI analysis, and scored a number of pathological features for the purpose of evaluating the accuracy of the Bethesda Criteria and MsPath at predicting MSI-H tumours. Results: Our work validates the MsPath tool in the Newfoundland population for the same age group used to create the tool. We found it identified MSI-H cases with a sensitivity of 95% and specificity of 35% in our population of CRCcases diagnosed before age 60 years (n=290). We also tested this tool on our older population of CRCcases, diagnosed at ages 60 to 74 years (n=426). We found it to be at least as predictive in this population,with a sensitivity of 95% and a specificity of 42%. We then used our entire cohort (N=716) to compare MsPath with the other Bethesda criteria.Bethesda criteria 1, 2, 4 and 5 together predicted MSI-H cases with a sensitivity of 67% and a specificity of 51%. MsPath was better at identifying these cases, with a sensitivity of 95% and a specificity of 39%. Conclusions: We conclude that MsPath can be extended to include patients diagnosed with CRC before age 75 years. As well, we have found that MsPath is a better predictive tool than the Revised Bethesda Guidelines for identifying MSI-H cases within a population-based setting of colorectal cancer. References: 1. Umar, A. et. al. J Natl Cancer Inst 2004;96:261-8 2.Jenkins, M.A. et. al. Gastroenterology 2007;133:48-56

Download Full-text