scholarly journals i6mA-DNCP: Computational Identification of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features

Genes ◽  
2019 ◽  
Vol 10 (10) ◽  
pp. 828 ◽  
Author(s):  
Liang Kong ◽  
Lichao Zhang

DNA N6-methyladenine (6mA) plays an important role in regulating the gene expression of eukaryotes. Accurate identification of 6mA sites may assist in understanding genomic 6mA distributions and biological functions. Various experimental methods have been applied to detect 6mA sites in a genome-wide scope, but they are too time-consuming and expensive. Developing computational methods to rapidly identify 6mA sites is needed. In this paper, a new machine learning-based method, i6mA-DNCP, was proposed for identifying 6mA sites in the rice genome. Dinucleotide composition and dinucleotide-based DNA properties were first employed to represent DNA sequences. After a specially designed DNA property selection process, a bagging classifier was used to build the prediction model. The jackknife test on a benchmark dataset demonstrated that i6mA-DNCP could obtain 84.43% sensitivity, 88.86% specificity, 86.65% accuracy, a 0.734 Matthew’s correlation coefficient (MCC), and a 0.926 area under the receiver operating characteristic curve (AUC). Moreover, three independent datasets were established to assess the generalization ability of our method. Extensive experiments validated the effectiveness of i6mA-DNCP.

2019 ◽  
Vol 35 (16) ◽  
pp. 2796-2800 ◽  
Author(s):  
Wei Chen ◽  
Hao Lv ◽  
Fulei Nie ◽  
Hao Lin

Abstract Motivation DNA N6-methyladenine (6mA) is associated with a wide range of biological processes. Since the distribution of 6mA site in the genome is non-random, accurate identification of 6mA sites is crucial for understanding its biological functions. Although experimental methods have been proposed for this regard, they are still cost-ineffective for detecting 6mA site in genome-wide scope. Therefore, it is desirable to develop computational methods to facilitate the identification of 6mA site. Results In this study, a computational method called i6mA-Pred was developed to identify 6mA sites in the rice genome, in which the optimal nucleotide chemical properties obtained by the using feature selection technique were used to encode the DNA sequences. It was observed that the i6mA-Pred yielded an accuracy of 83.13% in the jackknife test. Meanwhile, the performance of i6mA-Pred was also superior to other methods. Availability and implementation A user-friendly web-server, i6mA-Pred is freely accessible at http://lin-group.cn/server/i6mA-Pred.


2021 ◽  
Vol 9 (8) ◽  
pp. 1570
Author(s):  
Chien-Hsun Huang ◽  
Chih-Chieh Chen ◽  
Yu-Chun Lin ◽  
Chia-Hsuan Chen ◽  
Ai-Yun Lee ◽  
...  

The current taxonomy of the Lactiplantibacillus plantarum group comprises of 17 closely related species that are indistinguishable from each other by using commonly used 16S rRNA gene sequencing. In this study, a whole-genome-based analysis was carried out for exploring the highly distinguished target genes whose interspecific sequence identity is significantly less than those of 16S rRNA or conventional housekeeping genes. In silico analyses of 774 core genes by the cano-wgMLST_BacCompare analytics platform indicated that csbB, morA, murI, mutL, ntpJ, rutB, trmK, ydaF, and yhhX genes were the most promising candidates. Subsequently, the mutL gene was selected, and the discrimination power was further evaluated using Sanger sequencing. Among the type strains, mutL exhibited a clearly superior sequence identity (61.6–85.6%; average: 66.6%) to the 16S rRNA gene (96.7–100%; average: 98.4%) and the conventional phylogenetic marker genes (e.g., dnaJ, dnaK, pheS, recA, and rpoA), respectively, which could be used to separat tested strains into various species clusters. Consequently, species-specific primers were developed for fast and accurate identification of L. pentosus, L. argentoratensis, L. plantarum, and L. paraplantarum. During this study, one strain (BCRC 06B0048, L. pentosus) exhibited not only relatively low mutL sequence identities (97.0%) but also a low digital DNA–DNA hybridization value (78.1%) with the type strain DSM 20314T, signifying that it exhibits potential for reclassification as a novel subspecies. Our data demonstrate that mutL can be a genome-wide target for identifying and classifying the L. plantarum group species and for differentiating novel taxa from known species.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


Genes ◽  
2021 ◽  
Vol 12 (3) ◽  
pp. 354
Author(s):  
Lu Zhang ◽  
Xinyi Qin ◽  
Min Liu ◽  
Ziwei Xu ◽  
Guangzhong Liu

As a prevalent existing post-transcriptional modification of RNA, N6-methyladenosine (m6A) plays a crucial role in various biological processes. To better radically reveal its regulatory mechanism and provide new insights for drug design, the accurate identification of m6A sites in genome-wide is vital. As the traditional experimental methods are time-consuming and cost-prohibitive, it is necessary to design a more efficient computational method to detect the m6A sites. In this study, we propose a novel cross-species computational method DNN-m6A based on the deep neural network (DNN) to identify m6A sites in multiple tissues of human, mouse and rat. Firstly, binary encoding (BE), tri-nucleotide composition (TNC), enhanced nucleic acid composition (ENAC), K-spaced nucleotide pair frequencies (KSNPFs), nucleotide chemical property (NCP), pseudo dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP) and position-specific dinucleotide propensity (PSDP) are employed to extract RNA sequence features which are subsequently fused to construct the initial feature vector set. Secondly, we use elastic net to eliminate redundant features while building the optimal feature subset. Finally, the hyper-parameters of DNN are tuned with Bayesian hyper-parameter optimization based on the selected feature subset. The five-fold cross-validation test on training datasets show that the proposed DNN-m6A method outperformed the state-of-the-art method for predicting m6A sites, with an accuracy (ACC) of 73.58%–83.38% and an area under the curve (AUC) of 81.39%–91.04%. Furthermore, the independent datasets achieved an ACC of 72.95%–83.04% and an AUC of 80.79%–91.09%, which shows an excellent generalization ability of our proposed method.


2018 ◽  
Vol 19 (10) ◽  
pp. 3145 ◽  
Author(s):  
Jie Yu ◽  
Weiguo Zhao ◽  
Wei Tong ◽  
Qiang He ◽  
Min-Young Yoon ◽  
...  

Salt toxicity is the major factor limiting crop productivity in saline soils. In this paper, 295 accessions including a heuristic core set (137 accessions) and 158 bred varieties were re-sequenced and ~1.65 million SNPs/indels were used to perform a genome-wide association study (GWAS) of salt-tolerance-related phenotypes in rice during the germination stage. A total of 12 associated peaks distributed on seven chromosomes using a compressed mixed linear model were detected. Determined by linkage disequilibrium (LD) blocks analysis, we finally obtained a total of 79 candidate genes. By detecting the highly associated variations located inside the genic region that overlapped with the results of LD block analysis, we characterized 17 genes that may contribute to salt tolerance during the seed germination stage. At the same time, we conducted a haplotype analysis of the genes with functional variations together with phenotypic correlation and orthologous sequence analyses. Among these genes, OsMADS31, which is a MADS-box family transcription factor, had a down-regulated expression under the salt condition and it was predicted to be involved in the salt tolerance at the rice germination stage. Our study revealed some novel candidate genes and their substantial natural variations in the rice genome at the germination stage. The GWAS in rice at the germination stage would provide important resources for molecular breeding and functional analysis of the salt tolerance during rice germination.


2019 ◽  
Vol 70 (15) ◽  
pp. 3867-3879 ◽  
Author(s):  
Anneke Frerichs ◽  
Julia Engelhorn ◽  
Janine Altmüller ◽  
Jose Gutierrez-Marcos ◽  
Wolfgang Werr

Abstract Fluorescence-activated cell sorting (FACS) and assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) were combined to analyse the chromatin state of lateral organ founder cells (LOFCs) in the peripheral zone of the Arabidopsis apetala1-1 cauliflower-1 double mutant inflorescence meristem. On a genome-wide level, we observed a striking correlation between transposase hypersensitive sites (THSs) detected by ATAC-seq and DNase I hypersensitive sites (DHSs). The mostly expanded DHSs were often substructured into several individual THSs, which correlated with phylogenetically conserved DNA sequences or enhancer elements. Comparing chromatin accessibility with available RNA-seq data, THS change configuration was reflected by gene activation or repression and chromatin regions acquired or lost transposase accessibility in direct correlation with gene expression levels in LOFCs. This was most pronounced immediately upstream of the transcription start, where genome-wide THSs were abundant in a complementary pattern to established H3K4me3 activation or H3K27me3 repression marks. At this resolution, the combined application of FACS/ATAC-seq is widely applicable to detect chromatin changes during cell-type specification and facilitates the detection of regulatory elements in plant promoters.


2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Zhi-guo E ◽  
Lei Wang ◽  
Ryan Qin ◽  
Haihong Shen ◽  
Jianhua Zhou

Rice growth is greatly affected by temperature. To examine how temperature influences gene expression in rice on a genome-wide basis, we utilised recently compiled next-generation sequencing datasets and characterised a number of RNA-sequence transcriptome samples in rice seedling leaf blades at 25°C and 30°C. Our analysis indicated that 50.4% of all genes in the rice genome (28,296/56,143) were expressed in rice samples grown at 25°C, whereas slightly fewer genes (50.2%; 28,189/56,143) were expressed in rice leaf blades grown at 30°C. Among the genes that were expressed, approximately 3% were highly expressed, whereas approximately 65% had low levels of expression. Further examination demonstrated that 821 genes had a twofold or higher increase in expression and that 553 genes had a twofold or greater decrease in expression at 25°C. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses suggested that the ribosome pathway and multiple metabolic pathways were upregulated at 25°C. Based on these results, we deduced that gene expression at both transcriptional and translational levels was stimulated at 25°C, perhaps in response to a suboptimal temperature condition. Finally, we observed that temperature markedly regulates several super-families of transcription factors, including bZIP, MYB, and WRKY.


Entropy ◽  
2019 ◽  
Vol 21 (8) ◽  
pp. 802
Author(s):  
Chun-xiao Sun ◽  
Yu Yang ◽  
Hua Wang ◽  
Wen-hu Wang

Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has enabled the identification of transcription factor binding sites (TFBSs) on a genome-wide scale. To effectively and efficiently discover TFBSs in the thousand or more DNA sequences generated by a ChIP-Seq data set, we propose a new algorithm named AP-ChIP. First, we set two thresholds based on probabilistic analysis to construct and further filter the cluster subsets. Then, we use Affinity Propagation (AP) clustering on the candidate cluster subsets to find the potential motifs. Experimental results on simulated data show that the AP-ChIP algorithm is able to make an almost accurate prediction of TFBSs in a reasonable time. Also, the validity of the AP-ChIP algorithm is tested on a real ChIP-Seq data set.


2014 ◽  
Vol 32 (4_suppl) ◽  
pp. 464-464
Author(s):  
Thai Huu Ho ◽  
Jeong-Heon Lee ◽  
Rafael Nunez Nateras ◽  
Erik P. Castle ◽  
Melissa L. Stanton ◽  
...  

464 Background: Although the von Hippel-Lindau (VHL) tumor suppressor gene is mutated in 60% of ccRCC, deletion of VHL in mice is insufficient for tumorigenesis. Sequencing of ccRCC tumors identified mutations in SETD2, a histone H3 lysine 36 (H3K36) trimethyltransferase. We hypothesize that loss of SETD2 methyltransferase activity alters the genome wide pattern of H3K36 trimethylation (H3K36me3) in ccRCC, and contributes to the cancer phenotype. Methods: To generate a genome-wide profile of H3K36me3 in frozen nephrectomy samples and RCC cell lines, we optimized a chromatin immunoprecipitation (ChIP) protocol for the isolation of DNA associated with H3K36me3. H3K36me3 is associated with open chromatin and an H3K36me3-specific antibody was used for immunoprecipitation of endogenous H3K36me3-bound DNA. ChIP PCR primers were optimized for active genes, such as actin, glyceraldehyde-3-phosphate dehydrogenase (GAPDH) and a “gene desert” on chromosome 12 (negative control). ChIP libraries were then generated from 3 paired uninvolved kidney and RCC and 2 RCC cell lines. In order to identify H3K36Me3 upregulated regions in uninvolved kidney and RCC, reads from the ChIP sequencing were mapped to the human genome using Burrows-Wheeler Aligner and SICER algorithms. Results: Using ChIP PCR, we found that active genomic regions were enriched 15-30 fold over the negative controls indicating that the quality and yield of immunoprecipitated DNA/chromatin complexes from frozen tissue was sufficient for ChIP sequencing. A preliminary ChIP sequencing analysis of RCC cell lines and frozen ccRCC tissue indicates that H3K36me3 enriched DNA sequences were mapped to exons (31.3%) compared to introns (13.5%, p<0.001), consistent with the role of H3K36me3 in transcription. Conclusions: Genomic regions enriched for H3K36Me3 binding were identified from patient-derived tissue and RCC cell lines. Current efforts are focused on comparing the H3K36me3 profiles between matched tumor and uninvolved kidney ChIP libraries to generate a genome wide map of dysregulated H3K36me3 modifications.


2015 ◽  
Vol 2 (9) ◽  
pp. 150156 ◽  
Author(s):  
Georgia Tsagkogeorga ◽  
Michael R. McGowen ◽  
Kalina T. J. Davies ◽  
Simon Jarman ◽  
Andrea Polanowski ◽  
...  

Recent studies have reported multiple cases of molecular adaptation in cetaceans related to their aquatic abilities. However, none of these has included the hippopotamus, precluding an understanding of whether molecular adaptations in cetaceans occurred before or after they split from their semi-aquatic sister taxa. Here, we obtained new transcriptomes from the hippopotamus and humpback whale, and analysed these together with available data from eight other cetaceans. We identified more than 11 000 orthologous genes and compiled a genome-wide dataset of 6845 coding DNA sequences among 23 mammals, to our knowledge the largest phylogenomic dataset to date for cetaceans. We found positive selection in nine genes on the branch leading to the common ancestor of hippopotamus and whales, and 461 genes in cetaceans compared to 64 in hippopotamus. Functional annotation revealed adaptations in diverse processes, including lipid metabolism, hypoxia, muscle and brain function. By combining these findings with data on protein–protein interactions, we found evidence suggesting clustering among gene products relating to nervous and muscular systems in cetaceans. We found little support for shared ancestral adaptations in the two taxa; most molecular adaptations in extant cetaceans occurred after their split with hippopotamids.


Sign in / Sign up

Export Citation Format

Share Document