i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome

Wei Chen; Hao Lv; Fulei Nie; Hao Lin

doi:10.1093/bioinformatics/btz015

i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome

Bioinformatics ◽

10.1093/bioinformatics/btz015 ◽

2019 ◽

Vol 35 (16) ◽

pp. 2796-2800 ◽

Cited By ~ 98

Author(s):

Wei Chen ◽

Hao Lv ◽

Fulei Nie ◽

Hao Lin

Keyword(s):

Dna Sequences ◽

Rice Genome ◽

Chemical Properties ◽

Computational Method ◽

Accurate Identification ◽

Feature Selection Technique ◽

Jackknife Test ◽

Genome Wide ◽

Wide Range ◽

User Friendly

Abstract Motivation DNA N6-methyladenine (6mA) is associated with a wide range of biological processes. Since the distribution of 6mA site in the genome is non-random, accurate identification of 6mA sites is crucial for understanding its biological functions. Although experimental methods have been proposed for this regard, they are still cost-ineffective for detecting 6mA site in genome-wide scope. Therefore, it is desirable to develop computational methods to facilitate the identification of 6mA site. Results In this study, a computational method called i6mA-Pred was developed to identify 6mA sites in the rice genome, in which the optimal nucleotide chemical properties obtained by the using feature selection technique were used to encode the DNA sequences. It was observed that the i6mA-Pred yielded an accuracy of 83.13% in the jackknife test. Meanwhile, the performance of i6mA-Pred was also superior to other methods. Availability and implementation A user-friendly web-server, i6mA-Pred is freely accessible at http://lin-group.cn/server/i6mA-Pred.

Download Full-text

i6mA-DNCP: Computational Identification of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features

Genes ◽

10.3390/genes10100828 ◽

2019 ◽

Vol 10 (10) ◽

pp. 828 ◽

Cited By ~ 11

Author(s):

Liang Kong ◽

Lichao Zhang

Keyword(s):

Dna Sequences ◽

Selection Process ◽

Characteristic Curve ◽

Rice Genome ◽

Accurate Identification ◽

Jackknife Test ◽

Genome Wide ◽

Dinucleotide Composition ◽

A Genome ◽

Matthew’S Correlation Coefficient

DNA N6-methyladenine (6mA) plays an important role in regulating the gene expression of eukaryotes. Accurate identification of 6mA sites may assist in understanding genomic 6mA distributions and biological functions. Various experimental methods have been applied to detect 6mA sites in a genome-wide scope, but they are too time-consuming and expensive. Developing computational methods to rapidly identify 6mA sites is needed. In this paper, a new machine learning-based method, i6mA-DNCP, was proposed for identifying 6mA sites in the rice genome. Dinucleotide composition and dinucleotide-based DNA properties were first employed to represent DNA sequences. After a specially designed DNA property selection process, a bagging classifier was used to build the prediction model. The jackknife test on a benchmark dataset demonstrated that i6mA-DNCP could obtain 84.43% sensitivity, 88.86% specificity, 86.65% accuracy, a 0.734 Matthew’s correlation coefficient (MCC), and a 0.926 area under the receiver operating characteristic curve (AUC). Moreover, three independent datasets were established to assess the generalization ability of our method. Extensive experiments validated the effectiveness of i6mA-DNCP.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DNN-m6A: A Cross-Species Method for Identifying RNA N6-Methyladenosine Sites Based on Deep Neural Network with Multi-Information Fusion

Genes ◽

10.3390/genes12030354 ◽

2021 ◽

Vol 12 (3) ◽

pp. 354

Author(s):

Lu Zhang ◽

Xinyi Qin ◽

Min Liu ◽

Ziwei Xu ◽

Guangzhong Liu

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Area Under The Curve ◽

Nucleotide Composition ◽

Computational Method ◽

Feature Subset ◽

Accurate Identification ◽

Genome Wide ◽

Dinucleotide Composition ◽

Optimal Feature Subset

As a prevalent existing post-transcriptional modification of RNA, N6-methyladenosine (m6A) plays a crucial role in various biological processes. To better radically reveal its regulatory mechanism and provide new insights for drug design, the accurate identification of m6A sites in genome-wide is vital. As the traditional experimental methods are time-consuming and cost-prohibitive, it is necessary to design a more efficient computational method to detect the m6A sites. In this study, we propose a novel cross-species computational method DNN-m6A based on the deep neural network (DNN) to identify m6A sites in multiple tissues of human, mouse and rat. Firstly, binary encoding (BE), tri-nucleotide composition (TNC), enhanced nucleic acid composition (ENAC), K-spaced nucleotide pair frequencies (KSNPFs), nucleotide chemical property (NCP), pseudo dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP) and position-specific dinucleotide propensity (PSDP) are employed to extract RNA sequence features which are subsequently fused to construct the initial feature vector set. Secondly, we use elastic net to eliminate redundant features while building the optimal feature subset. Finally, the hyper-parameters of DNN are tuned with Bayesian hyper-parameter optimization based on the selected feature subset. The five-fold cross-validation test on training datasets show that the proposed DNN-m6A method outperformed the state-of-the-art method for predicting m6A sites, with an accuracy (ACC) of 73.58%–83.38% and an area under the curve (AUC) of 81.39%–91.04%. Furthermore, the independent datasets achieved an ACC of 72.95%–83.04% and an AUC of 80.79%–91.09%, which shows an excellent generalization ability of our proposed method.

Download Full-text

Analysis of Array-CGH Data Using the R and Bioconductor Software Suite

Comparative and Functional Genomics ◽

10.1155/2009/201325 ◽

2009 ◽

Vol 2009 ◽

pp. 1-8 ◽

Cited By ~ 7

Author(s):

Winfried A. Hofmann ◽

Anja Weigmann ◽

Marcel Tauscher ◽

Britta Skawran ◽

Tim Focken ◽

...

Keyword(s):

Array Cgh ◽

Molecular Genetic ◽

Comparative Genomic ◽

Accurate Identification ◽

Genome Wide ◽

Versatile Tool ◽

Breakpoint Detection ◽

Dna Copy Number Alterations ◽

User Friendly

Background. Array-based comparative genomic hybridization (array-CGH) is an emerging high-resolution and high-throughput molecular genetic technique that allows genome-wide screening for chromosome alterations. DNA copy number alterations (CNAs) are a hallmark of somatic mutations in tumor genomes and congenital abnormalities that lead to diseases such as mental retardation. However, accurate identification of amplified or deleted regions requires a sequence of different computational analysis steps of the microarray data.Results. We have developed a user-friendly and versatile tool for the normalization, visualization, breakpoint detection, and comparative analysis of array-CGH data which allows the accurate and sensitive detection of CNAs.Conclusion. The implemented option for the determination of minimal altered regions (MARs) from a series of tumor samples is a step forward in the identification of new tumor suppressor genes or oncogenes.

Download Full-text

hTFtarget: a comprehensive database for regulations of human transcription factors and their targets

10.1101/843656 ◽

2019 ◽

Cited By ~ 1

Author(s):

Qiong Zhang

Keyword(s):

Gene Expression ◽

Transcription Factors ◽

Dna Sequences ◽

Chromatin Immunoprecipitation ◽

Target Genes ◽

Gene Expression Regulation ◽

Epigenetic Modification ◽

Wide Range ◽

Chromatin Immunoprecipitation Sequencing ◽

User Friendly

Transcription factors (TFs) as key regulators play crucial roles in biological processes. The identification of TF-target regulatory relationships is a key step for revealing functions of TFs and their regulations on gene expression. The accumulated data of Chromatin immunoprecipitation sequencing (ChIP-Seq) provides great opportunities to discover the TF-target regulations across different conditions. In this study, we constructed a database named hTFtarget, which integrated huge human TF target resources (7,190 ChIP-Seq samples of 659 TFs and high confident TF binding sites of 699 TFs) and epigenetic modification information to predict accurate TF-target regulations. hTFtarget offers the following functions for users to explore TF-target regulations: 1) Browse or search general targets of a query TF across datasets; 2) Browse TF-target regulations for a query TF in a specific dataset or tissue; 3) Search potential TFs for a given target gene or ncRNA; 4) Investigate co-association between TFs in cell lines; 5) Explore potential co-regulations for given target genes or TFs; 6) Predict candidate TFBSs on given DNA sequences; 7) View ChIP-Seq peaks for different TFs and conditions in genome browser. hTFtarget provides a comprehensive, reliable and user-friendly resource for exploring human TF-target regulations, which will be very useful for a wide range of users in the TF and gene expression regulation community. hTFtarget is available at http://bioinfo.life.hust.edu.cn/hTFtarget.

Download Full-text

DBSCAN-SWA: an integrated tool for rapid prophage detection and annotation

10.1101/2020.07.12.199018 ◽

2020 ◽

Author(s):

Rui Gan ◽

Fengxia Zhou ◽

Yu Si ◽

Han Yang ◽

Chuangeng Chen ◽

...

Keyword(s):

Dna Sequences ◽

Bacterial Infections ◽

Software Tool ◽

High Specificity ◽

Bacterial Genomes ◽

Bacterial Host ◽

Accurate Identification ◽

Bacterial Dna ◽

Intracellular Form ◽

User Friendly

AbstractSummaryAs an intracellular form of a bacteriophage in the bacterial host genome, a prophage is usually integrated into bacterial DNA with high specificity and contributes to horizontal gene transfer (HGT). Phage therapy has been widely applied, for example, using phages to kill bacteria to treat pathogenic and resistant bacterial infections. Therefore, it is necessary to develop effective tools for the fast and accurate identification of prophages. Here, we introduce DBSCAN-SWA, a command line software tool developed to predict prophage regions of bacterial genomes. DBSCAN-SWA runs faster than any previous tool. Importantly, it has great detection power based on analysis using 184 manually curated prophages, with a recall of 85% compared with Phage_Finder (63%), VirSorter (74%) and PHASTER (82%) for raw DNA sequences. DBSCAN-SWA also provides user-friendly visualizations including a circular prophage viewer and interactive DataTables.Availability and implementationDBSCAN-SWA is implemented in Python3 and is freely available under an open source GPLv2 license from https://github.com/HIT-ImmunologyLab/DBSCAN-SWA/.

Download Full-text

EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species

Nucleic Acids Research ◽

10.1093/nar/gkz980 ◽

2019 ◽

Cited By ~ 8

Author(s):

Tianshun Gao ◽

Jiang Qian

Keyword(s):

Large Scale ◽

Target Gene ◽

Target Genes ◽

Cell Types ◽

Regulatory Elements ◽

Tissue Cell ◽

Normal Tissues ◽

Genome Wide ◽

Wide Range ◽

User Friendly

Abstract Enhancers are distal cis-regulatory elements that activate the transcription of their target genes. They regulate a wide range of important biological functions and processes, including embryogenesis, development, and homeostasis. As more and more large-scale technologies were developed for enhancer identification, a comprehensive database is highly desirable for enhancer annotation based on various genome-wide profiling datasets across different species. Here, we present an updated database EnhancerAtlas 2.0 (http://www.enhanceratlas.org/indexv2.php), covering 586 tissue/cell types that include a large number of normal tissues, cancer cell lines, and cells at different development stages across nine species. Overall, the database contains 13 494 603 enhancers, which were obtained from 16 055 datasets using 12 high-throughput experiment methods (e.g. H3K4me1/H3K27ac, DNase-seq/ATAC-seq, P300, POLR2A, CAGE, ChIA-PET, GRO-seq, STARR-seq and MPRA). The updated version is a huge expansion of the first version, which only contains the enhancers in human cells. In addition, we predicted enhancer–target gene relationships in human, mouse and fly. Finally, the users can search enhancers and enhancer–target gene relationships through five user-friendly, interactive modules. We believe the new annotation of enhancers in EnhancerAtlas 2.0 will facilitate users to perform useful functional analysis of enhancers in various genomes.

Download Full-text

StackRAM: a cross-species method for identifying RNA N6-methyladenosine sites based on stacked ensemble

10.1101/2020.04.23.058651 ◽

2020 ◽

Author(s):

Zhaomin Yu ◽

Baoguang Tian ◽

Yaning Liu ◽

Yaqun Zhang ◽

Qin Ma ◽

...

Keyword(s):

Feature Fusion ◽

Elastic Net ◽

Machine Learning Algorithms ◽

Computational Method ◽

Training Dataset ◽

Feature Subset ◽

Accurate Identification ◽

Jackknife Test ◽

Nucleotide Frequency ◽

Noisy Information

ABSTRACTN6-methyladenosine is a prevalent RNA methylation modification, which plays an important role in various biological processes. Accurate identification of the m6A sites is fundamental to deeply understand the biological functions and mechanisms of the modification. However, the experimental methods for detecting m6A sites are usually time-consuming and expensive, and various computational methods have been developed to identify m6A sites in RNA. This paper proposes a novel cross-species computational method StackRAM using machine learning algorithms to identify the m6A sites in S. cerevisiae、H. sapiens and A. thaliana. First, the RNA sequences features are extracted through binary encoding, chemical property, nucleotide frequency, k-mer nucleotide frequency, pseudo dinucleotide composition, and position-specific trinucleotide propensity, and the initial feature set is obtained by feature fusion. Secondly, the Elastic Net is used for the first time to filter redundant and noisy information and retain important features for m6A sites classification. Finally, the base-classifiers output probabilities are combined with the optimal feature subset corresponding to the Elastic Net, and the combination feature input the second-stage meta-classifier SVM. The jackknife test on training dataset S. cerevisiae indicates that the prediction performance of StackRAM is superior to the current state-of-the-art methods. StackRAM prediction accuracy for independent test datasets H. sapiens and A. thaliana reach 92.30% and 87.06%, respectively. Therefore, StackRAM has development potential in cross-species prediction and can be a useful method for identifying m6A sites. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/StackRAM/.

Download Full-text

iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications

Bioinformatics ◽

10.1093/bioinformatics/btaa155 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3336-3342 ◽

Cited By ~ 30

Author(s):

Kewei Liu ◽

Wei Chen

Keyword(s):

Computational Methods ◽

Homo Sapiens ◽

Supplementary Information ◽

Rna Modifications ◽

Feature Selection Technique ◽

Jackknife Test ◽

The Past ◽

User Friendly ◽

Fold Cross Validation ◽

Command Line Version

Abstract Motivation RNA modifications play critical roles in a series of cellular and developmental processes. Knowledge about the distributions of RNA modifications in the transcriptomes will provide clues to revealing their functions. Since experimental methods are time consuming and laborious for detecting RNA modifications, computational methods have been proposed for this aim in the past five years. However, there are some drawbacks for both experimental and computational methods in simultaneously identifying modifications occurred on different nucleotides. Results To address such a challenge, in this article, we developed a new predictor called iMRM, which is able to simultaneously identify m6A, m5C, m1A, ψ and A-to-I modifications in Homo sapiens, Mus musculus and Saccharomyces cerevisiae. In iMRM, the feature selection technique was used to pick out the optimal features. The results from both 10-fold cross-validation and jackknife test demonstrated that the performance of iMRM is superior to existing methods for identifying RNA modifications. Availability and implementation A user-friendly web server for iMRM was established at http://www.bioml.cn/XG_iRNA/home. The off-line command-line version is available at https://github.com/liukeweiaway/iMRM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

iT4SE-EP: Accurate Identification of Bacterial Type IV Secreted Effectors by Exploring Evolutionary Features from Two PSI-BLAST Profiles

Molecules ◽

10.3390/molecules26092487 ◽

2021 ◽

Vol 26 (9) ◽

pp. 2487

Author(s):

Haitao Han ◽

Chenchen Ding ◽

Xin Cheng ◽

Xiuzhi Sang ◽

Taigang Liu

Keyword(s):

Evolutionary Information ◽

Support Vector ◽

Target Cells ◽

Secretion Systems ◽

Accurate Identification ◽

Feature Selection Technique ◽

Type Iv ◽

Evolutionary Features ◽

Wide Range ◽

Encoding Strategies

Many gram-negative bacteria use type IV secretion systems to deliver effector molecules to a wide range of target cells. These substrate proteins, which are called type IV secreted effectors (T4SE), manipulate host cell processes during infection, often resulting in severe diseases or even death of the host. Therefore, identification of putative T4SEs has become a very active research topic in bioinformatics due to its vital roles in understanding host-pathogen interactions. PSI-BLAST profiles have been experimentally validated to provide important and discriminatory evolutionary information for various protein classification tasks. In the present study, an accurate computational predictor termed iT4SE-EP was developed for identifying T4SEs by extracting evolutionary features from the position-specific scoring matrix and the position-specific frequency matrix profiles. First, four types of encoding strategies were designed to transform protein sequences into fixed-length feature vectors based on the two profiles. Then, the feature selection technique based on the random forest algorithm was utilized to reduce redundant or irrelevant features without much loss of information. Finally, the optimal features were input into a support vector machine classifier to carry out the prediction of T4SEs. Our experimental results demonstrated that iT4SE-EP outperformed most of existing methods based on the independent dataset test.

Download Full-text