scholarly journals BACPHLIP: Predicting bacteriophage lifestyle from conserved protein domains

2020 ◽  
Author(s):  
Adam J. Hockenberry ◽  
Claus O. Wilke

AbstractMotivationBacteriophages are broadly classified into two distinct lifestyles: temperate (lysogenic) and virulent (lytic). Temperate phages are capable of a latent phase of infection within a host cell, whereas virulent phages directly replicate and lyse host cells upon infection. Accurate lifestyle identification is critical for determining the role of individual phage species within ecosystems and their effect on host evolution.ResultsHere, we present BACPHLIP, a BACterioPHage LIfestyle Predictor. BACPHLIP detects the presence of a set of conserved protein domains within an input genome and uses this data to predict lifestyle via a Random Forest classifier. The classifier was trained on 634 phage genomes. On an independent test set of 423 phages, BACPHLIP has an accuracy of 98%, greatly exceeding that of the best existing available tool (79%).AvailabilityBACPHLIP is freely available on GitHub (https://github.com/adamhockenberry/bacphlip) and the code used to build and test the classifier is provided in a separate repository (https://github.com/adamhockenberry/bacphlip-model-dev).

2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 2601-2601
Author(s):  
Tao Zhou ◽  
Libin Chen ◽  
Jing Guo ◽  
Mengmeng Zhang ◽  
Huanhuan Liu ◽  
...  

2601 Background: Microsatellite instability (MSI) is a common genomic alteration in several tumors, such as colorectal cancer, endometrial carcinoma, and stomach, which is characterized as microsatellite instability-high (MSI-H) and microsatellite stable (MSS) based on a high degree of polymorphism in microsatellite lengths. MSI is a predictive biomarker for immunotherapy efficacy in advanced/metastatic solid tumors, especially in colorectal cancer (CRC) patients. Several computational approaches based on target panel sequencing data have been used to detect MSI; However, they are considerably affected by the sequencing depth and panel size. Methods: We developed MSIFinder, a python package for automatic MSI classification, using random forest classifier (RFC)-based genome sequencing, which is a machine learning technology. We included 19 MSI-H and 25 MSS samples as training sets. First, RFC model were built by 54 feature markers from the training sets. Second. The software was validated the classifier using a test set comprising 21 MSI-H and 379 MSS samples. Results: With this test set, MSIFinder achieved a sensitivity (recall) of 0.997, a specificity of 1, an accuracy of 0.998, a positive predictive value (PPV) of 0.954, an F1 score of 0.977, and an area under curve (AUC) of 0.999. We discovered that MSIFinder is less affected by low sequencing depth and can achieve a concordance of 0.993, while exhibiting a sequencing depth of 100×. Furthermore, we realized that MSIFinder is less affected by the panel size and can achieve a concordance of 0.99 when the panel size is 0.5 m (million base). Conclusions: These results indicated that MSIFinder is a robust MSI classification tool and not affected by the panel size and sequencing depth. Furthermore, MSIFinder can provide reliable MSI detection for scientific and clinical purposes.[Table: see text]


2017 ◽  
Author(s):  
Javad Zahiri ◽  
Babak Khorsand-Ghaffari ◽  
Ramin Shirali Hossein Zade ◽  
Mohammadjavad Kargar ◽  
Ali Akbar Yousefi

ABSTRACTAngiogenesis inhibition research is a cutting edge in angiogenesis-dependent disease therapy, and especially in cancer therapy. Recently, studies on anti-angiogenic peptides have provided promising results in the cancer treatment field. In the current study we propose an effective machine learning based R package (AntAngioCOOL) to predict anti-angiogenic peptides. We have examined more than 200 different classifiers to build an efficient predictor. Also, more than 17000 features have been extracted to encode the peptides. However, finally, more than 2000 informative features have been selected to train the classifiers. According to the obtained results AntAngioCOOL can effectively predict anti-angiogenic peptides: this tool achieved sensitivity of 88%, specificity of 77% and accuracy of 75% on independent test set. AntAngioCOOL can be accessed at https://cran.r-project.org/.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tao Zhou ◽  
Libin Chen ◽  
Jing Guo ◽  
Mengmeng Zhang ◽  
Yanrui Zhang ◽  
...  

Abstract Background Microsatellite instability (MSI) is a common genomic alteration in colorectal cancer, endometrial carcinoma, and other solid tumors. MSI is characterized by a high degree of polymorphism in microsatellite lengths owing to the deficiency in the mismatch repair system. Based on the degree, MSI can be classified as microsatellite instability-high (MSI-H) and microsatellite stable (MSS). MSI is a predictive biomarker for immunotherapy efficacy in advanced/metastatic solid tumors, especially in colorectal cancer patients. Several computational approaches based on target panel sequencing data have been used to detect MSI; however, they are considerably affected by the sequencing depth and panel size. Results We developed MSIFinder, a python package for automatic MSI classification, using random forest classifier (RFC)-based genome sequencing, which is a machine learning technology. We included 19 MSI-H and 25 MSS samples as training sets. First, we selected 54 feature markers from the training sets, built an RFC model, and validated the classifier using a test set comprising 21 MSI-H and 379 MSS samples. With this test set, MSIFinder achieved a sensitivity (recall) of 1.0, a specificity of 0.997, an accuracy of 0.998, a positive predictive value of 0.954, an F1 score of 0.977, and an area under the curve of 0.999. To further verify the robustness and effectiveness of the model, we used a prospective cohort consisting of 18 MSI-H samples and 122 MSS samples. MSIFinder achieved a sensitivity (recall) of 1.0 and a specificity of 1.0. We discovered that MSIFinder is less affected by a low sequencing depth and can achieve a concordance of 0.993 while exhibiting a sequencing depth of 100×. Furthermore, we realized that MSIFinder is less affected by the panel size and can achieve a concordance of 0.99 when the panel size is 0.5 M (million bases). Conclusion These results indicate that MSIFinder is a robust and effective MSI classification tool that can provide reliable MSI detection for scientific and clinical purposes.


2018 ◽  
Author(s):  
Chetna Kumari ◽  
Naidu Subbarao ◽  
Muhammad Abulaish

AbstractAutophagy (in Greek: self-eating) is the cellular process for delivery of heterogenic intracellular material to lysosomal digestion. Protein kinases are integral to the autophagy process, and when dysregulated or mutated cause several human diseases. Atg1, the first autophagy-related protein identified is a serine/threonine protein kinases (STPKs). mTOR (mammalian Target of Rapamycin), AMPK (AMP-activated protein kinase), Akt, MAPK (mitogen-activated protein kinase) and PKC (protein kinase C) are other STPKs which regulate various components/steps of autophagy, and are often deregulated in cancer. MAPK have three subfamilies – ERKs, p38, and JNKs. JNKs (c-Jun N-terminal Kinases) have three isoforms in mammals – JNK1, JNK2, and JNK3, each with distinct cellular locations and functions. JNK1 plays role in starvation induced activation of autophagy, and the context-specific role of autophagy in tumorigenesis establish JNK1 a challenging anticancer drug target. Since JNKs are closely related to other members of MAPK family (p38, MAP kinase and the ERKs), it is difficult to design JNK-selective inhibitors. Designing JNK isoform-selective inhibitors are even more challenging as the ATP-binding sites among all JNKs are highly conserved. Although limited informations are available to explore computational approaches to predict JNK1 inhibitors, it seems diificult to find literature exploring machine learning techniques to predict JNKs inhibitors. This study aims to apply machine learning to predict JNK1 inhibitors regulating autophagy in cancer using Random Forest (RF). Here, RF algorithm is used for two purposes‐ to select and rank the molecular descriptors calculated using PaDEL descriptor software and as clasifier. The descriptors are prioritized by calculating Variable Importance Measures (VIMs) using functions based on mean square error (IncMSE) and node purity (IncNodePurity) of RF. The classification models based on a set of 22 prioritized descriptors shows accuracy 86.36%, precision 88.27% and AUC (Area Under ROC curve) 0.8914. We conclude that machine learning-based compound classification using Random Forest is one of the ligand-based approach that can be opted for virtual screening of large compound library of JNK1 bioactives.Author SummaryOut of the three isoforms of JNKs (cJun N-terminal Kinases) in human (each with distinct cellular locations and functions), JNK1 plays role in starvation induced activation of autophagy. The role of JNK1 in autophagy modulation and dual role of autophagy in tumor cells makes JNK1 a promising anticancer drug target. Since JNKs are closely related to other members of MAPK (Mitogen-Activated Protein Kinases) family, it is difficult to design JNK selective inhibitors. Designing JNK isoformselective inhibitors are even more challenging as the ATP binding sites among all JNKs are highly conserved. Random forest classifier usually outperforms several other machine learning algorithms for classification and prediction tasks in diverse areas of research. In this work, we have used Random Forest algorithm for two purposes: (i) calculating variable importance measures to rank and select molecular features, and (ii) predicting JNK1 inhibitors regulating autophagy in cancer. We have used paDEL calculated molecular features of JNK1 bioactivity dataset from ChEMBL database to build classification models using random forest classifier. Our results show that by optimally selecting features from top 10% based on variable importance measure the classification accuracy is high, and the classification model proposed in this study can be integrated with drug design pipeline to virtually screen compound libraries for predicting JNK1 inhibitors.


Author(s):  
Zhijun Qiu ◽  
Qingjie Liu

A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein–protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.


2021 ◽  
Author(s):  
Nikhil J. Dhinagar ◽  
Sophia I. Thomopoulos ◽  
Conor Owens-Walton ◽  
Dimitris Stripelis ◽  
Jose-Luis Ambite ◽  
...  

Parkinson's disease (PD) and Alzheimer's disease (AD) are progressive neurodegenerative disorders that affect millions of people worldwide. In this work, we propose a deep learning approach to classify these diseases based on 3D T1-weighted brain MRI. We analyzed several datasets including the Parkinson's Progression Markers Initiative (PPMI), an independent dataset from the University of Pennsylvania School of Medicine (UPenn), the Alzheimer's Disease Neuroimaging Initiative (ADNI), and the Open Access Series of Imaging Studies (OASIS) dataset. PPMI and ADNI were partitioned to train (70%), validate (20%), and test (10%) a 3D convolutional neural network (CNN) for PD and AD classification. The UPenn and OASIS datasets were used as independent test sets to evaluate the model performance during inference. We also implemented a random forest classifier as a baseline model by extracting key radiomics features from the same T1-weighted MRI scans. The proposed 3D CNN model was trained from scratch for the classification tasks. For AD classification, the 3D CNN model achieved an ROC-AUC of 0.878 on the ADNI test set and an average ROC-AUC of 0.789 on the OASIS dataset. For PD classification, the proposed 3D CNN model achieved an ROC-AUC of 0.667 on the PPMI test set and an average ROC-AUC of 0.743 on the UPenn dataset. We also found that model performance was largely maintained when using only 25% of the training dataset. The 3D CNN outperformed the random forest classifier for both the PD and AD tasks. The 3D CNN also generalized better on unseen MRI data from different imaging centers. Our results show that the proposed 3D CNN model was less prone to overfitting for AD than for PD classification. This approach shows promise for screening of PD and AD patients using only T1-weighted brain MRI, which is relatively widely available. This model with additional validation could also be used to help differentiate between challenging cases of AD and PD when they present with similarly subtle motor and non-motor symptoms.


1990 ◽  
Vol 29 (03) ◽  
pp. 167-181 ◽  
Author(s):  
G. Hripcsak

AbstractA connectionist model for decision support was constructed out of several back-propagation modules. Manifestations serve as input to the model; they may be real-valued, and the confidence in their measurement may be specified. The model produces as its output the posterior probability of disease. The model was trained on 1,000 cases taken from a simulated underlying population with three conditionally independent manifestations. The first manifestation had a linear relationship between value and posterior probability of disease, the second had a stepped relationship, and the third was normally distributed. An independent test set of 30,000 cases showed that the model was better able to estimate the posterior probability of disease (the standard deviation of residuals was 0.046, with a 95% confidence interval of 0.046-0.047) than a model constructed using logistic regression (with a standard deviation of residuals of 0.062, with a 95% confidence interval of 0.062-0.063). The model fitted the normal and stepped manifestations better than the linear one. It accommodated intermediate levels of confidence well.


2018 ◽  
Vol 10 (5) ◽  
pp. 1-12
Author(s):  
B. Nassih ◽  
A. Amine ◽  
M. Ngadi ◽  
D. Naji ◽  
N. Hmina

Sign in / Sign up

Export Citation Format

Share Document