Identification of 2’-O-methylation Site by Investigating Multi-feature Extracting Techniques

Background: RNA methylation is a reversible post-transcriptional modification involving numerous biological processes. Ribose 2'-O-methylation is part of RNA methylation. It has shown that ribose 2'-O-methylation plays an important role in immune recognition and other pathogenesis. Objective: We aim to design a computational method to identify 2'-O-methylation. Methods: Different from the experimental method, we propose a computational workflow to identify the methylation site based on the multi-feature extracting algorithm. Results: With a voting procedure based on 7 best feature-classifier combinations, we achieved Accuracy of 76.5% in 10-fold cross-validation. Furthermore, we optimized features and input the optimized features into SVM. As a result, the AUC reached to 0.813. Conclusion: The RNA sample, especially the negative samples, used in this study are more objective and strict, so we obtained more representative results than state-of-arts studies.

Download Full-text

WITMSG: Large-scale Prediction of Human Intronic m6A RNA Methylation Sites from Sequence and Genomic Features

Current Genomics ◽

10.2174/1389202921666200211104140 ◽

2020 ◽

Vol 21 (1) ◽

pp. 67-76 ◽

Cited By ~ 4

Author(s):

Lian Liu ◽

Xiujuan Lei ◽

Jia Meng ◽

Zhen Wei

Keyword(s):

Large Scale ◽

Cross Validation ◽

Rna Localization ◽

Training Data ◽

Biological Processes ◽

Computational Framework ◽

Rna Methylation ◽

M6a Rna Methylation ◽

First Time ◽

Fold Cross Validation

Introduction: N6-methyladenosine (m6A) is one of the most widely studied epigenetic modifications. It plays important roles in various biological processes, such as splicing, RNA localization and degradation, many of which are related to the functions of introns. Although a number of computational approaches have been proposed to predict the m6A sites in different species, none of them were optimized for intronic m6A sites. As existing experimental data overwhelmingly relied on polyA selection in sample preparation and the intronic RNAs are usually underrepresented in the captured RNA library, the accuracy of general m6A sites prediction approaches is limited for intronic m6A sites prediction task. Methodology: A computational framework, WITMSG, dedicated to the large-scale prediction of intronic m6A RNA methylation sites in humans has been proposed here for the first time. Based on the random forest algorithm and using only known intronic m6A sites as the training data, WITMSG takes advantage of both conventional sequence features and a variety of genomic characteristics for improved prediction performance of intron-specific m6A sites. Results and Conclusion: It has been observed that WITMSG outperformed competing approaches (trained with all the m6A sites or intronic m6A sites only) in 10-fold cross-validation (AUC: 0.940) and when tested on independent datasets (AUC: 0.946). WITMSG was also applied intronome-wide in humans to predict all possible intronic m6A sites, and the prediction results are freely accessible at http://rnamd.com/intron/.

Download Full-text

IILLS: predicting virus-receptor interactions based on similarity and semi-supervised learning

BMC Bioinformatics ◽

10.1186/s12859-019-3278-3 ◽

2019 ◽

Vol 20 (S23) ◽

Cited By ~ 3

Author(s):

Cheng Yan ◽

Guihua Duan ◽

Fang-Xiang Wu ◽

Jianxin Wang

Keyword(s):

Infectious Diseases ◽

Cross Validation ◽

Sequence Similarity ◽

Least Square ◽

Computational Method ◽

Receptor Interaction ◽

Virus Receptor ◽

Receptor Interactions ◽

Leave One Out ◽

Fold Cross Validation

Abstract Background Viral infectious diseases are the serious threat for human health. The receptor-binding is the first step for the viral infection of hosts. To more effectively treat human viral infectious diseases, the hidden virus-receptor interactions must be discovered. However, current computational methods for predicting virus-receptor interactions are limited. Result In this study, we propose a new computational method (IILLS) to predict virus-receptor interactions based on Initial Interaction scores method via the neighbors and the Laplacian regularized Least Square algorithm. IILLS integrates the known virus-receptor interactions and amino acid sequences of receptors. The similarity of viruses is calculated by the Gaussian Interaction Profile (GIP) kernel. On the other hand, we also compute the receptor GIP similarity and the receptor sequence similarity. Then the sequence similarity is used as the final similarity of receptors according to the prediction results. The 10-fold cross validation (10CV) and leave one out cross validation (LOOCV) are used to assess the prediction performance of our method. We also compare our method with other three competing methods (BRWH, LapRLS, CMF). Conlusion The experiment results show that IILLS achieves the AUC values of 0.8675 and 0.9061 with the 10-fold cross validation and leave-one-out cross validation (LOOCV), respectively, which illustrates that IILLS is superior to the competing methods. In addition, the case studies also further indicate that the IILLS method is effective for the virus-receptor interaction prediction.

Download Full-text

Deep6mA: A deep learning framework for exploring similar patterns in DNA N6-methyladenine sites across different species

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008767 ◽

2021 ◽

Vol 17 (2) ◽

pp. e1008767

Author(s):

Zutan Li ◽

Hangjin Jiang ◽

Lingpeng Kong ◽

Yuanyuan Chen ◽

Kun Lang ◽

...

Keyword(s):

Deep Learning ◽

Prediction Accuracy ◽

Cross Validation ◽

Biological Processes ◽

Biological Functions ◽

Learning Framework ◽

Wide Range ◽

Genomic Scale ◽

Downstream Gene Expression ◽

Fold Cross Validation

N6-methyladenine (6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for under-standing of 6mA’s biological functions. However, the existing experimental techniques for detecting 6mA sites are cost-ineffective, which implies the great need of developing new computational methods for this problem. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca and Rosa chinensis with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression.

Download Full-text

Bipartite graph-based collaborative matrix factorization method for predicting miRNA-disease associations

BMC Bioinformatics ◽

10.1186/s12859-021-04486-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Feng Zhou ◽

Meng-Meng Yin ◽

Cui-Na Jiao ◽

Zhen Cui ◽

Jing-Xiu Zhao ◽

...

Keyword(s):

Bipartite Graph ◽

Matrix Factorization ◽

Cross Validation ◽

Rapid Development ◽

Factorization Method ◽

Computational Method ◽

Human Diseases ◽

Simulation Experiments ◽

Disease Associations ◽

Fold Cross Validation

Abstract Background With the rapid development of various advanced biotechnologies, researchers in related fields have realized that microRNAs (miRNAs) play critical roles in many serious human diseases. However, experimental identification of new miRNA–disease associations (MDAs) is expensive and time-consuming. Practitioners have shown growing interest in methods for predicting potential MDAs. In recent years, an increasing number of computational methods for predicting novel MDAs have been developed, making a huge contribution to the research of human diseases and saving considerable time. In this paper, we proposed an efficient computational method, named bipartite graph-based collaborative matrix factorization (BGCMF), which is highly advantageous for predicting novel MDAs. Results By combining two improved recommendation methods, a new model for predicting MDAs is generated. Based on the idea that some new miRNAs and diseases do not have any associations, we adopt the bipartite graph based on the collaborative matrix factorization method to complete the prediction. The BGCMF achieves a desirable result, with AUC of up to 0.9514 ± (0.0007) in the five-fold cross-validation experiments. Conclusions Five-fold cross-validation is used to evaluate the capabilities of our method. Simulation experiments are implemented to predict new MDAs. More importantly, the AUC value of our method is higher than those of some state-of-the-art methods. Finally, many associations between new miRNAs and new diseases are successfully predicted by performing simulation experiments, indicating that BGCMF is a useful method to predict more potential miRNAs with roles in various diseases.

Download Full-text

DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach

GigaScience ◽

10.1093/gigascience/giab056 ◽

2021 ◽

Vol 10 (9) ◽

Cited By ~ 1

Author(s):

Shufang Wu ◽

Zhencheng Fang ◽

Jie Tan ◽

Mo Li ◽

Chunhui Wang ◽

...

Keyword(s):

Dna Sequences ◽

Cross Validation ◽

Direct Detection ◽

Temperate Phage ◽

Computational Method ◽

New Strategy ◽

Culture Independent ◽

Fold Cross Validation ◽

Insight Into ◽

Metagenomics Analysis

Abstract Background Prokaryotic viruses referred to as phages can be divided into virulent and temperate phages. Distinguishing virulent and temperate phage–derived sequences in metavirome data is important for elucidating their different roles in interactions with bacterial hosts and regulation of microbial communities. However, there is no experimental or computational approach to effectively classify their sequences in culture-independent metavirome. We present a new computational method, DeePhage, which can directly and rapidly judge each read or contig as a virulent or temperate phage–derived fragment. Findings DeePhage uses a “one-hot” encoding form to represent DNA sequences in detail. Sequence signatures are detected via a convolutional neural network to obtain valuable local features. The accuracy of DeePhage on 5-fold cross-validation reaches as high as 89%, nearly 10% and 30% higher than that of 2 similar tools, PhagePred and PHACTS. On real metavirome, DeePhage correctly predicts the highest proportion of contigs when using BLAST as annotation, without apparent preferences. Besides, DeePhage reduces running time vs PhagePred and PHACTS by 245 and 810 times, respectively, under the same computational configuration. By direct detection of the temperate viral fragments from metagenome and metavirome, we furthermore propose a new strategy to explore phage transformations in the microbial community. The ability to detect such transformations provides us a new insight into the potential treatment for human disease. Conclusions DeePhage is a novel tool developed to rapidly and efficiently identify 2 kinds of phage fragments especially for metagenomics analysis. DeePhage is freely available via http://cqb.pku.edu.cn/ZhuLab/DeePhage or https://github.com/shufangwu/DeePhage.

Download Full-text

Mal-Prec: computational prediction of protein Malonylation sites via machine learning based feature integration

BMC Genomics ◽

10.1186/s12864-020-07166-w ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Xin Liu ◽

Liang Wang ◽

Jian Li ◽

Junfeng Hu ◽

Xiao Zhang

Keyword(s):

Cross Validation ◽

Chemical Properties ◽

Principal Component ◽

Computational Prediction ◽

Computational Method ◽

Support Vector ◽

Data Sets ◽

Post Translational Modification ◽

Human Proteins ◽

Fold Cross Validation

Abstract Background Malonylation is a recently discovered post-translational modification that is associated with a variety of diseases such as Type 2 Diabetes Mellitus and different types of cancers. Compared with experimental identification of malonylation sites, computational method is a time-effective process with comparatively low costs. Results In this study, we proposed a novel computational model called Mal-Prec (Malonylation Prediction) for malonylation site prediction through the combination of Principal Component Analysis and Support Vector Machine. One-hot encoding, physio-chemical properties, and composition of k-spaced acid pairs were initially performed to extract sequence features. PCA was then applied to select optimal feature subsets while SVM was adopted to predict malonylation sites. Five-fold cross-validation results showed that Mal-Prec can achieve better prediction performance compared with other approaches. AUC (area under the receiver operating characteristic curves) analysis achieved 96.47 and 90.72% on 5-fold cross-validation of independent data sets, respectively. Conclusion Mal-Prec is a computationally reliable method for identifying malonylation sites in protein sequences. It outperforms existing prediction tools and can serve as a useful tool for identifying and discovering novel malonylation sites in human proteins. Mal-Prec is coded in MATLAB and is publicly available at https://github.com/flyinsky6/Mal-Prec, together with the data sets used in this study.

Download Full-text

SNF-NN: computational method to predict drug-disease interactions using similarity network fusion and neural networks

BMC Bioinformatics ◽

10.1186/s12859-020-03950-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Tamer N. Jarada ◽

Jon G. Rokne ◽

Reda Alhajj

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Cross Validation ◽

Drug Repositioning ◽

Computational Method ◽

Machine Learning Techniques ◽

Similarity Network ◽

Novel Drug ◽

Similarity Information ◽

Fold Cross Validation

Abstract Background Drug repositioning is an emerging approach in pharmaceutical research for identifying novel therapeutic potentials for approved drugs and discover therapies for untreated diseases. Due to its time and cost efficiency, drug repositioning plays an instrumental role in optimizing the drug development process compared to the traditional de novo drug discovery process. Advances in the genomics, together with the enormous growth of large-scale publicly available data and the availability of high-performance computing capabilities, have further motivated the development of computational drug repositioning approaches. More recently, the rise of machine learning techniques, together with the availability of powerful computers, has made the area of computational drug repositioning an area of intense activities. Results In this study, a novel framework SNF-NN based on deep learning is presented, where novel drug-disease interactions are predicted using drug-related similarity information, disease-related similarity information, and known drug-disease interactions. Heterogeneous similarity information related to drugs and disease is fed to the proposed framework in order to predict novel drug-disease interactions. SNF-NN uses similarity selection, similarity network fusion, and a highly tuned novel neural network model to predict new drug-disease interactions. The robustness of SNF-NN is evaluated by comparing its performance with nine baseline machine learning methods. The proposed framework outperforms all baseline methods ($$AUC-ROC$$ A U C - R O C = 0.867, and $$AUC-PR$$ A U C - P R =0.876) using stratified 10-fold cross-validation. To further demonstrate the reliability and robustness of SNF-NN, two datasets are used to fairly validate the proposed framework’s performance against seven recent state-of-the-art methods for drug-disease interaction prediction. SNF-NN achieves remarkable performance in stratified 10-fold cross-validation with $$AUC-ROC$$ A U C - R O C ranging from 0.879 to 0.931 and $$AUC-PR$$ A U C - P R from 0.856 to 0.903. Moreover, the efficiency of SNF-NN is verified by validating predicted unknown drug-disease interactions against clinical trials and published studies. Conclusion In conclusion, computational drug repositioning research can significantly benefit from integrating similarity measures in heterogeneous networks and deep learning models for predicting novel drug-disease interactions. The data and implementation of SNF-NN are available at http://pages.cpsc.ucalgary.ca/ tnjarada/snf-nn.php.

Download Full-text

SNF-NN: Computational Method To Predict Drug-Disease Interactions Using Similarity Network Fusion and Neural Networks

10.21203/rs.3.rs-56433/v1 ◽

2020 ◽

Author(s):

Tamer Jarada ◽

Jon Rokne ◽

Reda Alhajj

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Cross Validation ◽

Drug Repositioning ◽

Computational Method ◽

Machine Learning Techniques ◽

Similarity Network ◽

Novel Drug ◽

Similarity Information ◽

Fold Cross Validation

Abstract Drug repositioning is an emerging approach in pharmaceutical research for identifying novel therapeutic potentials for approved drugs and discover therapies for untreated diseases. Due to its time and cost efficiency, drug repositioning plays an instrumental role in optimizing the drug development process compared to the traditional de novo drug discovery process. Advances in the genomics, together with the enormous growth of large-scale publicly available data and the availability of high-performance computing capabilities, have further motivated the development of computational drug repositioning approaches. More recently, the rise of machine learning techniques, together with the availability of powerful computers, has made the area of computational drug repositioning an area of intense activities. In this study, a novel framework SNF-NN based on deep learning is presented, where novel drugdisease interactions are predicted using drug-related similarity information, disease-related similarity information, and known drug-disease interactions. Heterogeneous similarity information related to drugs and disease is fed to the proposed framework in order to predict novel drug-disease interactions. SNF-NN uses similarity selection, similarity network fusion, and a highly tuned novel neural network model to predict new drug-disease interactions. The robustness of SNF-NN is evaluated by comparing its performance with nine baseline machine learning methods. The proposed framework outperforms all baseline methods (AUC − ROC = 0.867, and AUC − P R=0.876) using stratified 10-fold cross-validation. To further demonstrate the reliability and robustness of SNF-NN, two datasets are used to fairly validate the proposed framework’s performance against seven recent state-of-the-art methods for drug-disease interaction prediction. SNF-NN achieves remarkable performance in stratified 10-fold cross-validation with AUC − ROC ranging from 0.879 to 0.931 and AUC − P R from 0.856 to 0.903. Moreover, the efficiency of SNF-NN by is verified by validating predicted unknown drug-disease interactions against clinical trials and published studies. In conclusion, computational drug repositioning research can significantly benefit from integrating similarity measures in heterogeneous networks and deep learning models for predicting novel drug-disease interactions

Download Full-text

Deep6mA: a deep learning framework for exploring similar patterns in DNA N6-methyladenine sites across different species

10.1101/2019.12.28.889824 ◽

2019 ◽

Author(s):

Zutan Li ◽

Hangjin Jiang ◽

Lingpeng Kong ◽

Yuanyuan Chen ◽

Liangyun Zhang ◽

...

Keyword(s):

Deep Learning ◽

Prediction Accuracy ◽

Cross Validation ◽

Biological Processes ◽

Biological Functions ◽

Learning Framework ◽

Wide Range ◽

Genomic Scale ◽

Downstream Gene Expression ◽

Fold Cross Validation

ABSTRACTN6-methyladenin(6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for understanding of 6mA’s biological functions. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca, and Rosa chinensis, with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression.

Download Full-text

A Novel Approach Based on Point Cut Set to Predict Associations of Diseases and LncRNAs

Current Bioinformatics ◽

10.2174/1574893613666181026122045 ◽

2019 ◽

Vol 14 (4) ◽

pp. 333-343 ◽

Cited By ~ 3

Author(s):

Linai Kuang ◽

Haochen Zhao ◽

Lei Wang ◽

Zhanwei Xuan ◽

Tingrui Pei

Keyword(s):

Cross Validation ◽

State Of The Art ◽

Interaction Network ◽

Research Field ◽

Computational Method ◽

Difference Matrix ◽

Art Methods ◽

Disease Associations ◽

Cut Set ◽

Fold Cross Validation

Background: In recent years, more evidence have progressively indicated that Long non-coding RNAs (lncRNAs) play vital roles in wide-ranging human diseases, which can serve as potential biomarkers and drug targets. Comparing with vast lncRNAs being found, the relationships between lncRNAs and diseases remain largely unknown. Objective: The prediction of novel and potential associations between lncRNAs and diseases would contribute to dissect the complex mechanisms of disease pathogenesis. associations while known disease-lncRNA associations are required only. Method: In this paper, a new computational method based on Point Cut Set is proposed to predict LncRNA-Disease Associations (PCSLDA) based on known lncRNA-disease associations. Compared with the existing state-of-the-art methods, the major novelty of PCSLDA lies in the incorporation of distance difference matrix and point cut set to set the distance correlation coefficient of nodes in the lncRNA-disease interaction network. Hence, PCSLDA can be applied to forecast potential lncRNAdisease associations while known disease-lncRNA associations are required only. Results: Simulation results show that PCSLDA can significantly outperform previous state-of-the-art methods with reliable AUC of 0.8902 in the leave-one-out cross-validation and AUCs of 0.7634 and 0.8317 in 5-fold cross-validation and 10-fold cross-validation respectively. And additionally, 70% of top 10 predicted cancer-lncRNA associations can be confirmed. Conclusion: It is anticipated that our proposed model can be a great addition to the biomedical research field.

Download Full-text