scholarly journals Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting

2020 ◽  
Vol 36 (17) ◽  
pp. 4576-4582
Author(s):  
Yaobin Ke ◽  
Jiahua Rao ◽  
Huiying Zhao ◽  
Yutong Lu ◽  
Nong Xiao ◽  
...  

Abstract Motivation RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage. Results Here, we have developed a new method for the prediction of genome-wide RNA secondary structure profile from RNA sequence based on the extreme gradient boosting technique. The method achieves predictions with areas under the receiver operating characteristic curve (AUC) >0.9 on three different datasets, and AUC of 0.888 by another independent test on the recently released Zika virus data. These AUCs are consistently >5% greater than those by the CROSS method recently developed based on a shallow neural network. Further analysis on the 1000 Genome Project data showed that our predicted unpaired probabilities are highly correlated (>0.8) with the minor allele frequencies at synonymous, non-synonymous mutations, and mutations in untranslated regions, which were higher than those generated by RNAplfold. Moreover, the prediction over all human mRNA indicated a consistent result with previous observation that there is a periodic distribution of unpaired probability on codons. The accurate predictions by our method indicate that such model trained on genome-wide experimental data might be an alternative for analytical methods. Availability and implementation The GRASP is available for academic use at https://github.com/sysu-yanglab/GRASP. Supplementary information Supplementary data are available online.

2019 ◽  
Author(s):  
Yaobin Ke ◽  
Jiahua Rao ◽  
Huiying Zhao ◽  
Yutong Lu ◽  
Nong Xiao ◽  
...  

AbstractMotivationMany studies have shown that RNA secondary structure plays a vital role in fundamental cellular processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. Identification of RNA secondary structure is a key step to understand the common mechanisms underlying the translation process. Recently, a few experimental methods were developed to measure genome-wide RNA secondary structure profile through high-throughput sequencing techniques, and have been successfully applied to genomes including yeast and human. However, these high-throughput methods usually have low precision and are hard to cover all nucleotides on the RNA due to limited sequencing coverage.ResultsIn this study, we developed a new method for the prediction of genome-wide RNA secondary structure profile (TH-GRASP) from RNA sequence based on eXtreme Gradient Boosting (XGBoost). The method achieves an prediction with areas under the receiver operating characteristic curve (AUC) values greater than 0.9 on three different datasets, and AUC of 0.892 by an independent test on the recently released Zika virus RNA dataset. These AUCs represent a consistent increase of >6% than the recently developed method CROSS trained by a shallow neural network. A further analysis on the 1000-Genome Project data showed that our predicted unpaired probability at mutations sites are highly correlated with the minor allele frequencies (MAF) of synonymous, non-synonymous mutations, and mutations in 3’ and 5’UTR with Pearson Correlation Coefficients all above 0.8. These PCCs are consistently higher than those generated by RNAplfold method. Moreover, an investigation over all human mRNA indicated a periodic distribution of the predicted unpaired probability on codons, and a decrease of paired probability in the boundary with 5’ and 3’ untranslated regions. These results highlighted TH-GRASP is effective to remove experimental noises and to have ability to make predictions on nucleotides with low or no coverage by fitting high-throughput genomic data for RNA secondary structure profiles, and also suggested that building model on high throughput experimental data might be a future direction to substitute analytical methods.AvailabilityThe TH-GRASP is available for academic use athttps://github.com/sysu-yanglab/TH-GRASP.Supplementary informationSupplementary data are available online.


2016 ◽  
Vol 17 (1) ◽  
Author(s):  
Nathan D. Berkowitz ◽  
Ian M. Silverman ◽  
Daniel M. Childress ◽  
Hilal Kazan ◽  
Li-San Wang ◽  
...  

2020 ◽  
Vol 3 (1) ◽  
Author(s):  
Juan Xie ◽  
Jinfang Zheng ◽  
Xu Hong ◽  
Xiaoxue Tong ◽  
Shiyong Liu

AbstractProtein-RNA interaction participates in many biological processes. So, studying protein–RNA interaction can help us to understand the function of protein and RNA. Although the protein–RNA 3D3D model, like PRIME, was useful in building 3D structural complexes, it can’t be used genome-wide, due to lacking RNA 3D structures. To take full advantage of RNA secondary structures revealed from high-throughput sequencing, we present PRIME-3D2D to predict binding sites of protein–RNA interaction. PRIME-3D2D is almost as good as PRIME at modeling protein–RNA complexes. PRIME-3D2D can be used to predict binding sites on PDB data (MCC = 0.75/0.70 for binding sites in protein/RNA) and transcription-wide (MCC = 0.285 for binding sites in RNA). Testing on PDB and yeast transcription-wide data show that PRIME-3D2D performs better than other binding sites predictor. So, PRIME-3D2D can be used to predict the binding sites both on PDB and genome-wide, and it’s freely available.


2020 ◽  
Vol 36 (12) ◽  
pp. 3632-3636 ◽  
Author(s):  
Weibo Zheng ◽  
Jing Chen ◽  
Thomas G Doak ◽  
Weibo Song ◽  
Ying Yan

Abstract Motivation Programmed DNA elimination (PDE) plays a crucial role in the transitions between germline and somatic genomes in diverse organisms ranging from unicellular ciliates to multicellular nematodes. However, software specific for the detection of DNA splicing events is scarce. In this paper, we describe Accurate Deletion Finder (ADFinder), an efficient detector of PDEs using high-throughput sequencing data. ADFinder can predict PDEs with relatively low sequencing coverage, detect multiple alternative splicing forms in the same genomic location and calculate the frequency for each splicing event. This software will facilitate research of PDEs and all down-stream analyses. Results By analyzing genome-wide DNA splicing events in two micronuclear genomes of Oxytricha trifallax and Tetrahymena thermophila, we prove that ADFinder is effective in predicting large scale PDEs. Availability and implementation The source codes and manual of ADFinder are available in our GitHub website: https://github.com/weibozheng/ADFinder. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Yuansheng Liu ◽  
Xiaocai Zhang ◽  
Quan Zou ◽  
Xiangxiang Zeng

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document