scholarly journals DeephageTP: A Convolutional Neural Network Framework For Identifying Phage-Specific Proteins From Metagenomic Sequencing Data

2021 ◽  
Author(s):  
Yunmeng Chu ◽  
Shun Guo ◽  
Dachao Cui ◽  
Xiongfei Fu ◽  
Yingfei Ma

Abstract Backgroud: Bacteriophage (phage) is the most abundant and diverse biological entity on the Earth. This makes it a challenge to identify and annotate phage genomes efficiently on a large scale.Results: Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are the three specific proteins of the tailed phage. Here, we develop a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three protein sequences encoded by the metagenome data. The framework takes one-hot encoding data of the original protein sequences as the input and extracts the predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of the sequences within the same category. The proposed model with the set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the result shows that compared to the conventional alignment-based methods, our proposed framework has a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training dataset.Conclusions: In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.

2020 ◽  
Author(s):  
Yunmeng Chu ◽  
Shun Guo ◽  
Dachao Cui ◽  
Haoran Zhang ◽  
xiongfei Fu ◽  
...  

Abstract Background: Bacteriophage (phage) is the most abundant and diverse biological entity on the Earth. This makes it a challenge to identify and annotate phage genomes efficiently on a large scale. Portal (portal protein), TerL (large terminase subunit protein) and TerS (small terminase subunit protein) are the three specific proteins of the tailed phage. Here, we develop a CNN (convolutional neural network)-based framework, DeephageTP, to identify these three proteins from metagenome data. The framework takes one-hot encoding data of the original protein sequences as the input and extracts the predictive features in the process of modeling. The cutoff loss value for each protein category was determined by exploiting the distributions of the loss values of the sequences within the same category. Finally, we tested the efficacy of the framework using three real metagenomic datasets. Result: The proposed multiclass classification CNN-based model was trained by the training datasets and shows relatively high prediction performance ( A ccuracy : Portal, 98.8%; TerL, 98.6%; TerS, 97.8%) for the three protein categories, respectively. The experiments using the independent mimic dataset demonstrate that the performance of the model could become worse along with the increase of the data size. To address this issue, we determined and set the cutoff loss values (i.e., TerL: -5.2, Portal: -4.2, TerS: -2.9) for each of the three categories, respectively. With these values, the model obtains high performance in terms of Precision in identifying the TerL and Portal sequences (i.e, ~ 94% and ~ 90%, respectively) from the mimic dataset that is 20 times larger than the training dataset. More interestingly, the framework identified from the three real metagenomic datasets many novel phage sequences that are not detectable by the two alignment-based methods (i.e., DIAMOND and HMMER). Conclusion: Compared to the conventional alignment-based methods, our proposed framework shows high performance in identifying phage-specific protein sequences with a particular advantage in identifying the novel protein sequences with remote homology to their known counterparts in public databases. Indeed, our method could also be applied for identifying the other protein sequences with the characteristic of high complexity and low conservation. The DeephageTP is available at https://github.com/chuym726/DeephageTP .


2019 ◽  
Vol 17 (01) ◽  
pp. 1950004 ◽  
Author(s):  
Chun Fang ◽  
Yoshitaka Moriwaki ◽  
Aikui Tian ◽  
Caihong Li ◽  
Kentaro Shimizu

Molecular recognition features (MoRFs) are key functional regions of intrinsically disordered proteins (IDPs), which play important roles in the molecular interaction network of cells and are implicated in many serious human diseases. Identifying MoRFs is essential for both functional studies of IDPs and drug design. This study adopts the cutting-edge machine learning method of artificial intelligence to develop a powerful model for improving MoRFs prediction. We proposed a method, named as en_DCNNMoRF (ensemble deep convolutional neural network-based MoRF predictor). It combines the outcomes of two independent deep convolutional neural network (DCNN) classifiers that take advantage of different features. The first, DCNNMoRF1, employs position-specific scoring matrix (PSSM) and 22 types of amino acid-related factors to describe protein sequences. The second, DCNNMoRF2, employs PSSM and 13 types of amino acid indexes to describe protein sequences. For both single classifiers, DCNN with a novel two-dimensional attention mechanism was adopted, and an average strategy was added to further process the output probabilities of each DCNN model. Finally, en_DCNNMoRF combined the two models by averaging their final scores. When compared with other well-known tools applied to the same datasets, the accuracy of the novel proposed method was comparable with that of state-of-the-art methods. The related web server can be accessed freely via http://vivace.bi.a.u-tokyo.ac.jp:8008/fang/en_MoRFs.php .


2021 ◽  
Author(s):  
Xiangning Chen ◽  
Daniel G CHEN ◽  
Zhongming Zhao ◽  
Justin M Balko ◽  
Jingchun CHEN

Abstract Background: Transcriptome sequencing has been broadly available in clinical studies. However, it remains a challenge to utilize these data effectively to due to the high dimension of the data and the high correlation of gene expression. Methods: We propose a novel method that transforms RNA sequencing data into artificial image objects (AIOs) and apply convolutional neural network (CNN) algorithm to classify these AIOs. The AIO technique considers each gene as a pixel in digital image, standardizes and rescales gene expression levels into a range suitable for image display. Using the GSE81538 (n = 405) and GSE96058 (n = 3,373) datasets, we create AIOs for the subjects and design CNN models to classify biomarker Ki67 and Nottingham histologic grade (NHG). Results: With 5-fold cross validation, we accomplish a classification accuracy and AUC of 0.797 ± 0.034 and 0.820 ± 0.064 for Ki67 status. For NHG, the weighted average of categorical accuracy is 0.726 ± 0.018, and the weighted average of AUC is 0.848 ± 0.019. With GSE81538 as training data and GSE96058 as testing data, the accuracy and AUC for Ki67 are 0.772 ± 0.014 and 0.820 ± 0.006, and that for NHG are 0.682 ± 0.013 and 0.808 ± 0.003 respectively. These results are comparable to or better than the results reported in the original study. For both Ki67 and NHG, the calls from our models have similar predictive power for survival as the calls from trained pathologists in survival analyses. Comparing the calls from our models and the pathologists, we find that the discordant subjects for Ki67 are a group of patients for whom estrogen receptor, progesterone receptor, PAM50 and NHG could not predict their survival rate, and their responses to chemotherapy and endocrine therapy are also different from the concordant subjects. Conclusions: RNA sequencing data can be transformed into AIOs and be used to classify the status of Ki67 and NHG by CNN algorithm. The AIO method can handle high dimension data with highly correlated variables with no requirement for variable selection, leading to a data-driven, consistent and automation-ready approach to model RNA sequencing data.


2018 ◽  
Author(s):  
Emma Guerin ◽  
Andrey Shkoporov ◽  
Stephen R. Stockdale ◽  
Adam G. Clooney ◽  
Feargal J. Ryan ◽  
...  

AbstractCrAssphage is yet to be cultured even though it represents the most abundant virus in the gut microbiota of humans. Recently, sequence based classification was performed on distantly related crAss-like phages from multiple environments, leading to the proposal of a familial level taxonomic group [Yutin N, et al. (2018) Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut. Nat Microbiol 3(1):38–46]. Here, we assembled the metagenomic sequencing reads from 702 human faecal virome/phageome samples and obtained 98 complete circular crAss-like phage genomes and 145 contigs ≥70kb. In silico comparative genomics and taxonomic analysis was performed, resulting in a classification scheme of crAss-like phages from human faecal microbiomes into 4 candidate subfamilies composed of 10 candidate genera. Moreover, laboratory analysis was performed on faecal samples from an individual harbouring 7 distinct crAss-like phages. We achieved propagation of crAss-like phages in ex vivo human faecal fermentations and visualised Podoviridae virions by electron microscopy. Furthermore, detection of a crAss-like phage capsid protein could be linked to metagenomic sequencing data confirming crAss-like phage structural annotations.SignificanceCrAssphage is the most abundant biological entity in the human gut, but it remains uncultured in the laboratory and its host(s) is unknown. CrAssphage was not identified in metagenomic studies for many years as its sequence is so different from anything present in databases. To this day, it can only be detected from sequences assembled from metagenomics or viromic datasets (crAss – cross Assembly). In this study, we identified 243 new crAss-like phages from human faecal metagenomic studies. Taxonomic analysis of these crAss-like phages highlighted their extensive diversity within the human microbiome. We also present the first propagation of crAssphage in faecal fermentations and provide the first electron micrographs of this extraordinary bacteriophage.


2021 ◽  
Vol 17 (10) ◽  
pp. e1009428
Author(s):  
Ryota Sugimoto ◽  
Luca Nishimura ◽  
Phuong Thanh Nguyen ◽  
Jumpei Ito ◽  
Nicholas F. Parrish ◽  
...  

Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yang Wang ◽  
Zhanchao Li ◽  
Yanfei Zhang ◽  
Yingjun Ma ◽  
Qixing Huang ◽  
...  

Abstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. Results We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. Conclusion The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/.


2021 ◽  
Author(s):  
Junyu Fan ◽  
Chutao Chen ◽  
Chen Song ◽  
Jiajie Pan ◽  
Guifu Wu

Surveillance of circulating variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is of great importance in controlling the coronavirus disease 2019 (COVID-19) pandemic. We propose an alignment-free in silico approach for classifying SARS-CoV-2 variants based on their genomic sequences. A deep learning model was constructed utilizing a stacked 1-D convolutional neural network and multilayer perceptron (MLP). The pre-processed genomic sequencing data of the four SARS-CoV-2 variants were first fed to three stacked convolution-pooling nets to extract local linkage patterns in the sequences. Then a 2-layer MLP was used to compute the correlations between the input and output. Finally, a logistic regression model transformed the output and returned the probability values. Learning curves and stratified 10-fold cross-validation showed that the proposed classifier enables robust variant classification. External validation of the classifier showed an accuracy of 0.9962, precision of 0.9963, recall of 0.9963 and F1 score of 0.9962, outperforming other machine learning methods, including logistic regression, K-nearest neighbor, support vector machine, and random forest. By comparing our model with an MLP model without the convolution-pooling network, we demonstrate the essential role of convolution in extracting viral variant features. Thus, our results indicate that the proposed convolution-based multi-class gene classifier is efficient for the variant classification of SARS-CoV-2.


Sign in / Sign up

Export Citation Format

Share Document