i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome

2019 ◽  
Vol 35 (16) ◽  
pp. 2796-2800 ◽  
Author(s):  
Wei Chen ◽  
Hao Lv ◽  
Fulei Nie ◽  
Hao Lin

Abstract Motivation DNA N6-methyladenine (6mA) is associated with a wide range of biological processes. Since the distribution of 6mA site in the genome is non-random, accurate identification of 6mA sites is crucial for understanding its biological functions. Although experimental methods have been proposed for this regard, they are still cost-ineffective for detecting 6mA site in genome-wide scope. Therefore, it is desirable to develop computational methods to facilitate the identification of 6mA site. Results In this study, a computational method called i6mA-Pred was developed to identify 6mA sites in the rice genome, in which the optimal nucleotide chemical properties obtained by the using feature selection technique were used to encode the DNA sequences. It was observed that the i6mA-Pred yielded an accuracy of 83.13% in the jackknife test. Meanwhile, the performance of i6mA-Pred was also superior to other methods. Availability and implementation A user-friendly web-server, i6mA-Pred is freely accessible at http://lin-group.cn/server/i6mA-Pred.

Genes ◽  
2019 ◽  
Vol 10 (10) ◽  
pp. 828 ◽  
Author(s):  
Liang Kong ◽  
Lichao Zhang

DNA N6-methyladenine (6mA) plays an important role in regulating the gene expression of eukaryotes. Accurate identification of 6mA sites may assist in understanding genomic 6mA distributions and biological functions. Various experimental methods have been applied to detect 6mA sites in a genome-wide scope, but they are too time-consuming and expensive. Developing computational methods to rapidly identify 6mA sites is needed. In this paper, a new machine learning-based method, i6mA-DNCP, was proposed for identifying 6mA sites in the rice genome. Dinucleotide composition and dinucleotide-based DNA properties were first employed to represent DNA sequences. After a specially designed DNA property selection process, a bagging classifier was used to build the prediction model. The jackknife test on a benchmark dataset demonstrated that i6mA-DNCP could obtain 84.43% sensitivity, 88.86% specificity, 86.65% accuracy, a 0.734 Matthew’s correlation coefficient (MCC), and a 0.926 area under the receiver operating characteristic curve (AUC). Moreover, three independent datasets were established to assess the generalization ability of our method. Extensive experiments validated the effectiveness of i6mA-DNCP.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


Genes ◽  
2021 ◽  
Vol 12 (3) ◽  
pp. 354
Author(s):  
Lu Zhang ◽  
Xinyi Qin ◽  
Min Liu ◽  
Ziwei Xu ◽  
Guangzhong Liu

As a prevalent existing post-transcriptional modification of RNA, N6-methyladenosine (m6A) plays a crucial role in various biological processes. To better radically reveal its regulatory mechanism and provide new insights for drug design, the accurate identification of m6A sites in genome-wide is vital. As the traditional experimental methods are time-consuming and cost-prohibitive, it is necessary to design a more efficient computational method to detect the m6A sites. In this study, we propose a novel cross-species computational method DNN-m6A based on the deep neural network (DNN) to identify m6A sites in multiple tissues of human, mouse and rat. Firstly, binary encoding (BE), tri-nucleotide composition (TNC), enhanced nucleic acid composition (ENAC), K-spaced nucleotide pair frequencies (KSNPFs), nucleotide chemical property (NCP), pseudo dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP) and position-specific dinucleotide propensity (PSDP) are employed to extract RNA sequence features which are subsequently fused to construct the initial feature vector set. Secondly, we use elastic net to eliminate redundant features while building the optimal feature subset. Finally, the hyper-parameters of DNN are tuned with Bayesian hyper-parameter optimization based on the selected feature subset. The five-fold cross-validation test on training datasets show that the proposed DNN-m6A method outperformed the state-of-the-art method for predicting m6A sites, with an accuracy (ACC) of 73.58%–83.38% and an area under the curve (AUC) of 81.39%–91.04%. Furthermore, the independent datasets achieved an ACC of 72.95%–83.04% and an AUC of 80.79%–91.09%, which shows an excellent generalization ability of our proposed method.


2009 ◽  
Vol 2009 ◽  
pp. 1-8 ◽  
Author(s):  
Winfried A. Hofmann ◽  
Anja Weigmann ◽  
Marcel Tauscher ◽  
Britta Skawran ◽  
Tim Focken ◽  
...  

Background. Array-based comparative genomic hybridization (array-CGH) is an emerging high-resolution and high-throughput molecular genetic technique that allows genome-wide screening for chromosome alterations. DNA copy number alterations (CNAs) are a hallmark of somatic mutations in tumor genomes and congenital abnormalities that lead to diseases such as mental retardation. However, accurate identification of amplified or deleted regions requires a sequence of different computational analysis steps of the microarray data.Results. We have developed a user-friendly and versatile tool for the normalization, visualization, breakpoint detection, and comparative analysis of array-CGH data which allows the accurate and sensitive detection of CNAs.Conclusion. The implemented option for the determination of minimal altered regions (MARs) from a series of tumor samples is a step forward in the identification of new tumor suppressor genes or oncogenes.


2019 ◽  
Author(s):  
Qiong Zhang

Transcription factors (TFs) as key regulators play crucial roles in biological processes. The identification of TF-target regulatory relationships is a key step for revealing functions of TFs and their regulations on gene expression. The accumulated data of Chromatin immunoprecipitation sequencing (ChIP-Seq) provides great opportunities to discover the TF-target regulations across different conditions. In this study, we constructed a database named hTFtarget, which integrated huge human TF target resources (7,190 ChIP-Seq samples of 659 TFs and high confident TF binding sites of 699 TFs) and epigenetic modification information to predict accurate TF-target regulations. hTFtarget offers the following functions for users to explore TF-target regulations: 1) Browse or search general targets of a query TF across datasets; 2) Browse TF-target regulations for a query TF in a specific dataset or tissue; 3) Search potential TFs for a given target gene or ncRNA; 4) Investigate co-association between TFs in cell lines; 5) Explore potential co-regulations for given target genes or TFs; 6) Predict candidate TFBSs on given DNA sequences; 7) View ChIP-Seq peaks for different TFs and conditions in genome browser. hTFtarget provides a comprehensive, reliable and user-friendly resource for exploring human TF-target regulations, which will be very useful for a wide range of users in the TF and gene expression regulation community. hTFtarget is available at http://bioinfo.life.hust.edu.cn/hTFtarget.


2020 ◽  
Author(s):  
Rui Gan ◽  
Fengxia Zhou ◽  
Yu Si ◽  
Han Yang ◽  
Chuangeng Chen ◽  
...  

AbstractSummaryAs an intracellular form of a bacteriophage in the bacterial host genome, a prophage is usually integrated into bacterial DNA with high specificity and contributes to horizontal gene transfer (HGT). Phage therapy has been widely applied, for example, using phages to kill bacteria to treat pathogenic and resistant bacterial infections. Therefore, it is necessary to develop effective tools for the fast and accurate identification of prophages. Here, we introduce DBSCAN-SWA, a command line software tool developed to predict prophage regions of bacterial genomes. DBSCAN-SWA runs faster than any previous tool. Importantly, it has great detection power based on analysis using 184 manually curated prophages, with a recall of 85% compared with Phage_Finder (63%), VirSorter (74%) and PHASTER (82%) for raw DNA sequences. DBSCAN-SWA also provides user-friendly visualizations including a circular prophage viewer and interactive DataTables.Availability and implementationDBSCAN-SWA is implemented in Python3 and is freely available under an open source GPLv2 license from https://github.com/HIT-ImmunologyLab/DBSCAN-SWA/.


Author(s):  
Tianshun Gao ◽  
Jiang Qian

Abstract Enhancers are distal cis-regulatory elements that activate the transcription of their target genes. They regulate a wide range of important biological functions and processes, including embryogenesis, development, and homeostasis. As more and more large-scale technologies were developed for enhancer identification, a comprehensive database is highly desirable for enhancer annotation based on various genome-wide profiling datasets across different species. Here, we present an updated database EnhancerAtlas 2.0 (http://www.enhanceratlas.org/indexv2.php), covering 586 tissue/cell types that include a large number of normal tissues, cancer cell lines, and cells at different development stages across nine species. Overall, the database contains 13 494 603 enhancers, which were obtained from 16 055 datasets using 12 high-throughput experiment methods (e.g. H3K4me1/H3K27ac, DNase-seq/ATAC-seq, P300, POLR2A, CAGE, ChIA-PET, GRO-seq, STARR-seq and MPRA). The updated version is a huge expansion of the first version, which only contains the enhancers in human cells. In addition, we predicted enhancer–target gene relationships in human, mouse and fly. Finally, the users can search enhancers and enhancer–target gene relationships through five user-friendly, interactive modules. We believe the new annotation of enhancers in EnhancerAtlas 2.0 will facilitate users to perform useful functional analysis of enhancers in various genomes.


2020 ◽  
Author(s):  
Zhaomin Yu ◽  
Baoguang Tian ◽  
Yaning Liu ◽  
Yaqun Zhang ◽  
Qin Ma ◽  
...  

ABSTRACTN6-methyladenosine is a prevalent RNA methylation modification, which plays an important role in various biological processes. Accurate identification of the m6A sites is fundamental to deeply understand the biological functions and mechanisms of the modification. However, the experimental methods for detecting m6A sites are usually time-consuming and expensive, and various computational methods have been developed to identify m6A sites in RNA. This paper proposes a novel cross-species computational method StackRAM using machine learning algorithms to identify the m6A sites in S. cerevisiae、H. sapiens and A. thaliana. First, the RNA sequences features are extracted through binary encoding, chemical property, nucleotide frequency, k-mer nucleotide frequency, pseudo dinucleotide composition, and position-specific trinucleotide propensity, and the initial feature set is obtained by feature fusion. Secondly, the Elastic Net is used for the first time to filter redundant and noisy information and retain important features for m6A sites classification. Finally, the base-classifiers output probabilities are combined with the optimal feature subset corresponding to the Elastic Net, and the combination feature input the second-stage meta-classifier SVM. The jackknife test on training dataset S. cerevisiae indicates that the prediction performance of StackRAM is superior to the current state-of-the-art methods. StackRAM prediction accuracy for independent test datasets H. sapiens and A. thaliana reach 92.30% and 87.06%, respectively. Therefore, StackRAM has development potential in cross-species prediction and can be a useful method for identifying m6A sites. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/StackRAM/.


2020 ◽  
Vol 36 (11) ◽  
pp. 3336-3342 ◽  
Author(s):  
Kewei Liu ◽  
Wei Chen

Abstract Motivation RNA modifications play critical roles in a series of cellular and developmental processes. Knowledge about the distributions of RNA modifications in the transcriptomes will provide clues to revealing their functions. Since experimental methods are time consuming and laborious for detecting RNA modifications, computational methods have been proposed for this aim in the past five years. However, there are some drawbacks for both experimental and computational methods in simultaneously identifying modifications occurred on different nucleotides. Results To address such a challenge, in this article, we developed a new predictor called iMRM, which is able to simultaneously identify m6A, m5C, m1A, ψ and A-to-I modifications in Homo sapiens, Mus musculus and Saccharomyces cerevisiae. In iMRM, the feature selection technique was used to pick out the optimal features. The results from both 10-fold cross-validation and jackknife test demonstrated that the performance of iMRM is superior to existing methods for identifying RNA modifications. Availability and implementation A user-friendly web server for iMRM was established at http://www.bioml.cn/XG_iRNA/home. The off-line command-line version is available at https://github.com/liukeweiaway/iMRM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Molecules ◽  
2021 ◽  
Vol 26 (9) ◽  
pp. 2487
Author(s):  
Haitao Han ◽  
Chenchen Ding ◽  
Xin Cheng ◽  
Xiuzhi Sang ◽  
Taigang Liu

Many gram-negative bacteria use type IV secretion systems to deliver effector molecules to a wide range of target cells. These substrate proteins, which are called type IV secreted effectors (T4SE), manipulate host cell processes during infection, often resulting in severe diseases or even death of the host. Therefore, identification of putative T4SEs has become a very active research topic in bioinformatics due to its vital roles in understanding host-pathogen interactions. PSI-BLAST profiles have been experimentally validated to provide important and discriminatory evolutionary information for various protein classification tasks. In the present study, an accurate computational predictor termed iT4SE-EP was developed for identifying T4SEs by extracting evolutionary features from the position-specific scoring matrix and the position-specific frequency matrix profiles. First, four types of encoding strategies were designed to transform protein sequences into fixed-length feature vectors based on the two profiles. Then, the feature selection technique based on the random forest algorithm was utilized to reduce redundant or irrelevant features without much loss of information. Finally, the optimal features were input into a support vector machine classifier to carry out the prediction of T4SEs. Our experimental results demonstrated that iT4SE-EP outperformed most of existing methods based on the independent dataset test.


Sign in / Sign up

Export Citation Format

Share Document