BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information

As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.

Download Full-text

DNN-m6A: A Cross-Species Method for Identifying RNA N6-Methyladenosine Sites Based on Deep Neural Network with Multi-Information Fusion

Genes ◽

10.3390/genes12030354 ◽

2021 ◽

Vol 12 (3) ◽

pp. 354

Author(s):

Lu Zhang ◽

Xinyi Qin ◽

Min Liu ◽

Ziwei Xu ◽

Guangzhong Liu

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Area Under The Curve ◽

Nucleotide Composition ◽

Computational Method ◽

Feature Subset ◽

Accurate Identification ◽

Genome Wide ◽

Dinucleotide Composition ◽

Optimal Feature Subset

As a prevalent existing post-transcriptional modification of RNA, N6-methyladenosine (m6A) plays a crucial role in various biological processes. To better radically reveal its regulatory mechanism and provide new insights for drug design, the accurate identification of m6A sites in genome-wide is vital. As the traditional experimental methods are time-consuming and cost-prohibitive, it is necessary to design a more efficient computational method to detect the m6A sites. In this study, we propose a novel cross-species computational method DNN-m6A based on the deep neural network (DNN) to identify m6A sites in multiple tissues of human, mouse and rat. Firstly, binary encoding (BE), tri-nucleotide composition (TNC), enhanced nucleic acid composition (ENAC), K-spaced nucleotide pair frequencies (KSNPFs), nucleotide chemical property (NCP), pseudo dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP) and position-specific dinucleotide propensity (PSDP) are employed to extract RNA sequence features which are subsequently fused to construct the initial feature vector set. Secondly, we use elastic net to eliminate redundant features while building the optimal feature subset. Finally, the hyper-parameters of DNN are tuned with Bayesian hyper-parameter optimization based on the selected feature subset. The five-fold cross-validation test on training datasets show that the proposed DNN-m6A method outperformed the state-of-the-art method for predicting m6A sites, with an accuracy (ACC) of 73.58%–83.38% and an area under the curve (AUC) of 81.39%–91.04%. Furthermore, the independent datasets achieved an ACC of 72.95%–83.04% and an AUC of 80.79%–91.09%, which shows an excellent generalization ability of our proposed method.

Download Full-text

m5CPred-SVM: A Novel Method for Predicting m5C Sites of RNA

10.21203/rs.3.rs-39526/v2 ◽

2020 ◽

Author(s):

Xiao Chen ◽

Yi Xiong ◽

Yinbo Liu ◽

Yuqing Chen ◽

Shoudong Bi ◽

...

Keyword(s):

Cell Fate ◽

Prediction Accuracy ◽

Cytosine Methylation ◽

Low Cost ◽

Computational Method ◽

Selection Strategy ◽

Support Vector ◽

Feature Subset ◽

Biological Functions ◽

Accurate Identification

Abstract Background: As one of the most common post-transcriptional modifications (PTCM) in RNA, 5-cytosine-methylation plays important roles in many biological functions such as RNA metabolism and cell fate decision. Through accurate identification of 5-methylcytosine (m5C) sites on RNA, researchers can better understand the exact role of 5-cytosine-methylation in these biological functions. In recent years, computational methods of predicting m5C sites have attracted lots of interests because of its efficiency and low-cost. However, both the accuracy and efficiency of these methods are not satisfactory yet and need further improvement. Results: In this work, we have developed a new computational method, m5CPred-SVM, to identify m5C sites in three species, H. sapiens, M. musculus and A. thaliana. To build this model, we first collected benchmark datasets following three recently published methods. Then, six types of sequence-based features were generated based on RNA segments and the sequential forward feature selection strategy was used to obtain the optimal feature subset. After that, the performance of models based on different learning algorithms were compared, and the model based on the support vector machine provided the highest prediction accuracy. Finally, our proposed method, m5CPred-SVM was compared with several existing methods, and the result showed that m5CPred-SVM offered substantially higher prediction accuracy than previously published methods. It is expected that our method, m5CPred-SVM, can become a useful tool for accurate identification of m5C sites.Conclusion: In this study, by introducing position-specific propensity related features, we built a new model, m5CPred-SVM, to predict RNA m5C sites of three different species. The result shows that our model outperformed the existing state-of-art models. Our model is available for users through a web server at http://zhulab.ahu.edu.cn/m5CPred-SVM.

Download Full-text

m5CPred-SVM: A Novel Method for Predicting m5C Sites of RNA

10.21203/rs.3.rs-39526/v3 ◽

2020 ◽

Author(s):

Xiao Chen ◽

Yi Xiong ◽

Yinbo Liu ◽

Yuqing Chen ◽

Shoudong Bi ◽

...

Keyword(s):

Cell Fate ◽

Prediction Accuracy ◽

Cytosine Methylation ◽

Low Cost ◽

Computational Method ◽

Selection Strategy ◽

Support Vector ◽

Feature Subset ◽

Biological Functions ◽

Accurate Identification

Download Full-text

PredPSD: A Gradient Tree Boosting Approach for Single-Stranded and Double-Stranded DNA Binding Protein Prediction

Molecules ◽

10.3390/molecules25010098 ◽

2019 ◽

Vol 25 (1) ◽

pp. 98 ◽

Cited By ~ 1

Author(s):

Changgeng Tan ◽

Tong Wang ◽

Wenyi Yang ◽

Lei Deng

Keyword(s):

Dna Binding ◽

Binding Proteins ◽

Dna Binding Proteins ◽

Computational Method ◽

Sequence Information ◽

Feature Subset ◽

Single Stranded Dna ◽

Cell Functions ◽

Double Stranded Dna ◽

Maximal Relevance

Interactions between proteins and DNAs play essential roles in many biological processes. DNA binding proteins can be classified into two categories. Double-stranded DNA-binding proteins (DSBs) bind to double-stranded DNA and are involved in a series of cell functions such as gene expression and regulation. Single-stranded DNA-binding proteins (SSBs) are necessary for DNA replication, recombination, and repair and are responsible for binding to the single-stranded DNA. Therefore, the effective classification of DNA-binding proteins is helpful for functional annotations of proteins. In this work, we propose PredPSD, a computational method based on sequence information that accurately predicts SSBs and DSBs. It introduces three novel feature extraction algorithms. In particular, we use the autocross-covariance (ACC) transformation to transform feature matrices into fixed-length vectors. Then, we put the optimal feature subset obtained by the minimal-redundancy-maximal-relevance criterion (mRMR) feature selection algorithm into the gradient tree boosting (GTB). In 10-fold cross-validation based on a benchmark dataset, PredPSD achieves promising performances with an AUC score of 0.956 and an accuracy of 0.912, which are better than those of existing methods. Moreover, our method has significantly improved the prediction accuracy in independent testing. The experimental results show that PredPSD can significantly recognize the binding specificity and differentiate DSBs and SSBs.

Download Full-text

StackRAM: a cross-species method for identifying RNA N6-methyladenosine sites based on stacked ensemble

10.1101/2020.04.23.058651 ◽

2020 ◽

Author(s):

Zhaomin Yu ◽

Baoguang Tian ◽

Yaning Liu ◽

Yaqun Zhang ◽

Qin Ma ◽

...

Keyword(s):

Feature Fusion ◽

Elastic Net ◽

Machine Learning Algorithms ◽

Computational Method ◽

Training Dataset ◽

Feature Subset ◽

Accurate Identification ◽

Jackknife Test ◽

Nucleotide Frequency ◽

Noisy Information

ABSTRACTN6-methyladenosine is a prevalent RNA methylation modification, which plays an important role in various biological processes. Accurate identification of the m6A sites is fundamental to deeply understand the biological functions and mechanisms of the modification. However, the experimental methods for detecting m6A sites are usually time-consuming and expensive, and various computational methods have been developed to identify m6A sites in RNA. This paper proposes a novel cross-species computational method StackRAM using machine learning algorithms to identify the m6A sites in S. cerevisiae、H. sapiens and A. thaliana. First, the RNA sequences features are extracted through binary encoding, chemical property, nucleotide frequency, k-mer nucleotide frequency, pseudo dinucleotide composition, and position-specific trinucleotide propensity, and the initial feature set is obtained by feature fusion. Secondly, the Elastic Net is used for the first time to filter redundant and noisy information and retain important features for m6A sites classification. Finally, the base-classifiers output probabilities are combined with the optimal feature subset corresponding to the Elastic Net, and the combination feature input the second-stage meta-classifier SVM. The jackknife test on training dataset S. cerevisiae indicates that the prediction performance of StackRAM is superior to the current state-of-the-art methods. StackRAM prediction accuracy for independent test datasets H. sapiens and A. thaliana reach 92.30% and 87.06%, respectively. Therefore, StackRAM has development potential in cross-species prediction and can be a useful method for identifying m6A sites. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/StackRAM/.

Download Full-text

Predicting LncRNA Subcellular Localization Using Unbalanced Pseudo-k Nucleotide Compositions

Current Bioinformatics ◽

10.2174/1574893614666190902151038 ◽

2020 ◽

Vol 15 (6) ◽

pp. 554-562

Author(s):

Xiao-Fei Yang ◽

Yuan-Ke Zhou ◽

Lin Zhang ◽

Yang Gao ◽

Pu-Feng Du

Keyword(s):

Subcellular Localization ◽

Regulation Of Gene Expression ◽

Nucleotide Composition ◽

Support Vector ◽

Sequence Information ◽

Feature Subset ◽

Feature Selection Technique ◽

Optimal Feature Subset ◽

Leave One Out ◽

Correlated Factors

Background: Long non-coding RNAs (lncRNAs) are transcripts with a length more than 200 nucleotides, functioning in the regulation of gene expression. More evidence has shown that the biological functions of lncRNAs are intimately related to their subcellular localizations. Therefore, it is very important to confirm the lncRNA subcellular localization. Methods: In this paper, we proposed a novel method to predict the subcellular localization of lncRNAs. To more comprehensively utilize lncRNA sequence information, we exploited both kmer nucleotide composition and sequence order correlated factors of lncRNA to formulate lncRNA sequences. Meanwhile, a feature selection technique which was based on the Analysis Of Variance (ANOVA) was applied to obtain the optimal feature subset. Finally, we used the support vector machine (SVM) to perform the prediction. Results: The AUC value of the proposed method can reach 0.9695, which indicated the proposed predictor is an efficient and reliable tool for determining lncRNA subcellular localization. Furthermore, the predictor can reach the maximum overall accuracy of 90.37% in leave-one-out cross validation, which clearly outperforms the existing state-of- the-art method. Conclusion: It is demonstrated that the proposed predictor is feasible and powerful for the prediction of lncRNA subcellular. To facilitate subsequent genetic sequence research, we shared the source code at https://github.com/NicoleYXF/lncRNA.

Download Full-text

m5CPred-SVM: a novel method for predicting m5C sites of RNA

BMC Bioinformatics ◽

10.1186/s12859-020-03828-4 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Xiao Chen ◽

Yi Xiong ◽

Yinbo Liu ◽

Yuqing Chen ◽

Shoudong Bi ◽

...

Keyword(s):

Cell Fate ◽

Prediction Accuracy ◽

Cytosine Methylation ◽

Low Cost ◽

Computational Method ◽

Selection Strategy ◽

Support Vector ◽

Feature Subset ◽

Biological Functions ◽

Accurate Identification

Abstract Background As one of the most common post-transcriptional modifications (PTCM) in RNA, 5-cytosine-methylation plays important roles in many biological functions such as RNA metabolism and cell fate decision. Through accurate identification of 5-methylcytosine (m5C) sites on RNA, researchers can better understand the exact role of 5-cytosine-methylation in these biological functions. In recent years, computational methods of predicting m5C sites have attracted lots of interests because of its efficiency and low-cost. However, both the accuracy and efficiency of these methods are not satisfactory yet and need further improvement. Results In this work, we have developed a new computational method, m5CPred-SVM, to identify m5C sites in three species, H. sapiens, M. musculus and A. thaliana. To build this model, we first collected benchmark datasets following three recently published methods. Then, six types of sequence-based features were generated based on RNA segments and the sequential forward feature selection strategy was used to obtain the optimal feature subset. After that, the performance of models based on different learning algorithms were compared, and the model based on the support vector machine provided the highest prediction accuracy. Finally, our proposed method, m5CPred-SVM was compared with several existing methods, and the result showed that m5CPred-SVM offered substantially higher prediction accuracy than previously published methods. It is expected that our method, m5CPred-SVM, can become a useful tool for accurate identification of m5C sites. Conclusion In this study, by introducing position-specific propensity related features, we built a new model, m5CPred-SVM, to predict RNA m5C sites of three different species. The result shows that our model outperformed the existing state-of-art models. Our model is available for users through a web server at https://zhulab.ahu.edu.cn/m5CPred-SVM.

Download Full-text

m5CPred-SVM: A Novel Method for Predicting m5C Sites of RNA

10.21203/rs.3.rs-39526/v1 ◽

2020 ◽

Author(s):

Xiao Chen ◽

Yi Xiong ◽

Yinbo Liu ◽

Yuqing Chen ◽

Shoudong Bi ◽

...

Keyword(s):

Cell Fate ◽

Prediction Accuracy ◽

Cytosine Methylation ◽

Computational Method ◽

Selection Strategy ◽

Support Vector ◽

Feature Subset ◽

Accurate Identification ◽

Benchmark Datasets ◽

Optimal Feature Subset

Abstract Background: As one of the most common post-transcriptional modifications (PTCM) in RNA, 5-cytosine-methylation plays important roles in many biological functionssuch as RNA metabolism and cell fate decision. Through accurate identification of 5-methylcytosine (m5C) sites on RNA,researcherscanbetter understandthe exact role of 5-cytosine-methylation in these biological functions. In recent years, computational methods of predicting m5C sites have attracted lots of interests because of its efficiency and low-cost.However, both the accuracy and efficiency of these methods are not satisfactory yet and need further improvement.Results: In this work, we have developed a new computational method, m5CPred-SVM, to identify m5C sites in three species, H. sapiens, M. musculus and A. thaliana. To build this model, we first collected benchmark datasets following three recently published methods. Then, six types of sequence-based features were generated based on RNA segments and the sequential forward feature selection strategy was used to obtain the optimal feature subset. After that, the performance of models based on different learning algorithms were compared, and the model based on the support vector machine provided the highest prediction accuracy. Finally, our proposed method, m5CPred-SVM was compared with several existing methods, and the result showed that m5CPred-SVMoffered substantially higher prediction accuracy thanpreviously published methods. It is expected that our method, m5CPred-SVM, can become a useful tool for accurate identification of m5C sites.Conclusion: In this study, by introducing position-specific propensity related features, we built a new model, m5CPred-SVM, to predict RNA m5C sites of three different species.The result shows that our model outperformed the existing state-of-art models.Our model is available for users through a web serverat http://zhulab.ahu.edu.cn/m5CPred-SVM.

Download Full-text

Identification of D Modification Sites by Integrating Heterogeneous Features in Saccharomyces cerevisiae

Molecules ◽

10.3390/molecules24030380 ◽

2019 ◽

Vol 24 (3) ◽

pp. 380 ◽

Cited By ~ 3

Author(s):

Pengmian Feng ◽

Zhaochun Xu ◽

Hui Yang ◽

Hao Lv ◽

Hui Ding ◽

...

Keyword(s):

Saccharomyces Cerevisiae ◽

Ensemble Classifier ◽

Transfer Rna ◽

Test Results ◽

Accurate Identification ◽

Jackknife Test ◽

Heterogeneous Features ◽

Novel Methods

As an abundant post-transcriptional modification, dihydrouridine (D) has been found in transfer RNA (tRNA) from bacteria, eukaryotes, and archaea. Nonetheless, knowledge of the exact biochemical roles of dihydrouridine in mediating tRNA function is still limited. Accurate identification of the position of D sites is essential for understanding their functions. Therefore, it is desirable to develop novel methods to identify D sites. In this study, an ensemble classifier was proposed for the detection of D modification sites in the Saccharomyces cerevisiae transcriptome by using heterogeneous features. The jackknife test results demonstrate that the proposed predictor is promising for the identification of D modification sites. It is anticipated that the proposed method can be widely used for identifying D modification sites in tRNA.

Download Full-text

MpsLDA-ProSVM: predicting multi-label protein subcellular localization by wMLDAe dimensionality reduction and ProSVM classifier

10.1101/2020.04.19.049478 ◽

2020 ◽

Author(s):

Qi Zhang ◽

Shan Li ◽

Bin Yu ◽

Yang Li ◽

Yandan Zhang ◽

...

Keyword(s):

Subcellular Localization ◽

Nearest Neighbor ◽

Chemical Information ◽

Sequence Information ◽

Feature Subset ◽

Protein Subcellular Localization ◽

K Nearest Neighbor ◽

Entropy Weight ◽

Linear Discriminant ◽

Optimal Feature Subset

ABSTRACTProteins play a significant part in life processes such as cell growth, development, and reproduction. Exploring protein subcellular localization (SCL) is a direct way to better understand the function of proteins in cells. Studies have found that more and more proteins belong to multiple subcellular locations, and these proteins are called multi-label proteins. They not only play a key role in cell life activities, but also play an indispensable role in medicine and drug development. This article first presents a new prediction model, MpsLDA-ProSVM, to predict the SCL of multi-label proteins. Firstly, the physical and chemical information, evolution information, sequence information and annotation information of protein sequences are fused. Then, for the first time, use a weighted multi-label linear discriminant analysis framework based on entropy weight form (wMLDAe) to refine and purify features, reduce the difficulty of learning. Finally, input the optimal feature subset into the multi-label learning with label-specific features (LIFT) and multi-label k-nearest neighbor (ML-KNN) algorithms to obtain a synthetic ranking of relevant labels, and then use Prediction and Relevance Ordering based SVM (ProSVM) classifier to predict the SCLs. This method can rank and classify related tags at the same time, which greatly improves the efficiency of the model. Tested by jackknife method, the overall actual accuracy (OAA) on virus, plant, Gram-positive bacteria and Gram-negative bacteria datasets are 98.06%, 98.97%, 99.81% and 98.49%, which are 0.56%-9.16%, 5.37%-30.87%, 3.51%-6.91% and 3.99%-8.59% higher than other advanced methods respectively. The source codes and datasets are available at https://github.com/QUST-AIBBDRC/MpsLDA-ProSVM/.

Download Full-text