scholarly journals i4mC-EL: Identifying DNA N4-Methylcytosine Sites in the Mouse Genome Using Ensemble Learning

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Yanjuan Li ◽  
Zhengnan Zhao ◽  
Zhixia Teng

As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.

2020 ◽  
Author(s):  
Rui Gan ◽  
Fengxia Zhou ◽  
Yu Si ◽  
Han Yang ◽  
Chuangeng Chen ◽  
...  

AbstractSummaryAs an intracellular form of a bacteriophage in the bacterial host genome, a prophage is usually integrated into bacterial DNA with high specificity and contributes to horizontal gene transfer (HGT). Phage therapy has been widely applied, for example, using phages to kill bacteria to treat pathogenic and resistant bacterial infections. Therefore, it is necessary to develop effective tools for the fast and accurate identification of prophages. Here, we introduce DBSCAN-SWA, a command line software tool developed to predict prophage regions of bacterial genomes. DBSCAN-SWA runs faster than any previous tool. Importantly, it has great detection power based on analysis using 184 manually curated prophages, with a recall of 85% compared with Phage_Finder (63%), VirSorter (74%) and PHASTER (82%) for raw DNA sequences. DBSCAN-SWA also provides user-friendly visualizations including a circular prophage viewer and interactive DataTables.Availability and implementationDBSCAN-SWA is implemented in Python3 and is freely available under an open source GPLv2 license from https://github.com/HIT-ImmunologyLab/DBSCAN-SWA/.


2019 ◽  
Vol 35 (16) ◽  
pp. 2796-2800 ◽  
Author(s):  
Wei Chen ◽  
Hao Lv ◽  
Fulei Nie ◽  
Hao Lin

Abstract Motivation DNA N6-methyladenine (6mA) is associated with a wide range of biological processes. Since the distribution of 6mA site in the genome is non-random, accurate identification of 6mA sites is crucial for understanding its biological functions. Although experimental methods have been proposed for this regard, they are still cost-ineffective for detecting 6mA site in genome-wide scope. Therefore, it is desirable to develop computational methods to facilitate the identification of 6mA site. Results In this study, a computational method called i6mA-Pred was developed to identify 6mA sites in the rice genome, in which the optimal nucleotide chemical properties obtained by the using feature selection technique were used to encode the DNA sequences. It was observed that the i6mA-Pred yielded an accuracy of 83.13% in the jackknife test. Meanwhile, the performance of i6mA-Pred was also superior to other methods. Availability and implementation A user-friendly web-server, i6mA-Pred is freely accessible at http://lin-group.cn/server/i6mA-Pred.


Cells ◽  
2019 ◽  
Vol 8 (11) ◽  
pp. 1332 ◽  
Author(s):  
Manavalan ◽  
Basith ◽  
Shin ◽  
Lee ◽  
Wei ◽  
...  

DNA N4-methylcytosine (4mC) is one of the key epigenetic alterations, playing essential roles in DNA replication, differentiation, cell cycle, and gene expression. To better understand 4mC biological functions, it is crucial to gain knowledge on its genomic distribution. In recent times, few computational studies, in particular machine learning (ML) approaches have been applied in the prediction of 4mC site predictions. Although ML-based methods are promising for 4mC identification in other species, none are available for detecting 4mCs in the mouse genome. Our novel computational approach, called 4mCpred-EL, is the first method for identifying 4mC sites in the mouse genome where four different ML algorithms with a wide range of seven feature encodings are utilized. Subsequently, those feature encodings predicted probabilistic values are used as a feature vector and are once again inputted to ML algorithms, whose corresponding models are integrated into ensemble learning. Our benchmarking results demonstrated that 4mCpred-EL achieved an accuracy and MCC values of 0.795 and 0.591, which significantly outperformed seven other classifiers by more than 1.5–5.9% and 3.2–11.7%, respectively. Additionally, 4mCpred-EL attained an overall accuracy of 79.80%, which is 1.8–5.1% higher than that yielded by seven other classifiers in the independent evaluation. We provided a user-friendly web server, namely 4mCpred-EL which could be implemented as a pre-screening tool for the identification of potential 4mC sites in the mouse genome.


2019 ◽  
Vol 20 (5) ◽  
pp. 565-578 ◽  
Author(s):  
Lidong Wang ◽  
Ruijun Zhang

Ubiquitination is an important post-translational modification (PTM) process for the regulation of protein functions, which is associated with cancer, cardiovascular and other diseases. Recent initiatives have focused on the detection of potential ubiquitination sites with the aid of physicochemical test approaches in conjunction with the application of computational methods. The identification of ubiquitination sites using laboratory tests is especially susceptible to the temporality and reversibility of the ubiquitination processes, and is also costly and time-consuming. It has been demonstrated that computational methods are effective in extracting potential rules or inferences from biological sequence collections. Up to the present, the computational strategy has been one of the critical research approaches that have been applied for the identification of ubiquitination sites, and currently, there are numerous state-of-the-art computational methods that have been developed from machine learning and statistical analysis to undertake such work. In the present study, the construction of benchmark datasets is summarized, together with feature representation methods, feature selection approaches and the classifiers involved in several previous publications. In an attempt to explore pertinent development trends for the identification of ubiquitination sites, an independent test dataset was constructed and the predicting results obtained from five prediction tools are reported here, together with some related discussions.


2020 ◽  
pp. bjophthalmol-2020-317825
Author(s):  
Yonghao Li ◽  
Weibo Feng ◽  
Xiujuan Zhao ◽  
Bingqian Liu ◽  
Yan Zhang ◽  
...  

Background/aimsTo apply deep learning technology to develop an artificial intelligence (AI) system that can identify vision-threatening conditions in high myopia patients based on optical coherence tomography (OCT) macular images.MethodsIn this cross-sectional, prospective study, a total of 5505 qualified OCT macular images obtained from 1048 high myopia patients admitted to Zhongshan Ophthalmic Centre (ZOC) from 2012 to 2017 were selected for the development of the AI system. The independent test dataset included 412 images obtained from 91 high myopia patients recruited at ZOC from January 2019 to May 2019. We adopted the InceptionResnetV2 architecture to train four independent convolutional neural network (CNN) models to identify the following four vision-threatening conditions in high myopia: retinoschisis, macular hole, retinal detachment and pathological myopic choroidal neovascularisation. Focal Loss was used to address class imbalance, and optimal operating thresholds were determined according to the Youden Index.ResultsIn the independent test dataset, the areas under the receiver operating characteristic curves were high for all conditions (0.961 to 0.999). Our AI system achieved sensitivities equal to or even better than those of retina specialists as well as high specificities (greater than 90%). Moreover, our AI system provided a transparent and interpretable diagnosis with heatmaps.ConclusionsWe used OCT macular images for the development of CNN models to identify vision-threatening conditions in high myopia patients. Our models achieved reliable sensitivities and high specificities, comparable to those of retina specialists and may be applied for large-scale high myopia screening and patient follow-up.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 22 (S3) ◽  
Author(s):  
Junyi Li ◽  
Huinian Li ◽  
Xiao Ye ◽  
Li Zhang ◽  
Qingzhe Xu ◽  
...  

Abstract Background The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs. Results We developed the lncRNA prediction method by integrating information-entropy-based features and machine learning algorithms. We calculate generalized topological entropy and generate 6 novel features for lncRNA sequences. By employing these 6 features and other features such as open reading frame, we apply supporting vector machine, XGBoost and random forest algorithms to distinguish human lncRNAs. We compare our method with the one which has more K-mer features and results show that our method has higher area under the curve up to 99.7905%. Conclusions We develop an accurate and efficient method which has novel information entropy features to analyze and classify lncRNAs. Our method is also extendable for research on the other functional elements in DNA sequences.


2021 ◽  
Vol 11 (11) ◽  
pp. 4742
Author(s):  
Tianpei Xu ◽  
Ying Ma ◽  
Kangchul Kim

In recent years, the telecom market has been very competitive. The cost of retaining existing telecom customers is lower than attracting new customers. It is necessary for a telecom company to understand customer churn through customer relationship management (CRM). Therefore, CRM analyzers are required to predict which customers will churn. This study proposes a customer-churn prediction system that uses an ensemble-learning technique consisting of stacking models and soft voting. Xgboost, Logistic regression, Decision tree, and Naïve Bayes machine-learning algorithms are selected to build a stacking model with two levels, and the three outputs of the second level are used for soft voting. Feature construction of the churn dataset includes equidistant grouping of customer behavior features to expand the space of features and discover latent information from the churn dataset. The original and new churn datasets are analyzed in the stacking ensemble model with four evaluation metrics. The experimental results show that the proposed customer churn predictions have accuracies of 96.12% and 98.09% for the original and new churn datasets, respectively. These results are better than state-of-the-art churn recognition systems.


IAWA Journal ◽  
2011 ◽  
Vol 32 (2) ◽  
pp. 221-232 ◽  
Author(s):  
Carolina Sarmiento ◽  
Pierre Détienne ◽  
Christine Heinz ◽  
Jean-François Molino ◽  
Pierre Grard ◽  
...  

Sustainable management and conservation of tropical trees and forests require accurate identification of tree species. Reliable, user-friendly identification tools based on macroscopic morphological features have already been developed for various tree floras. Wood anatomical features provide also a considerable amount of information that can be used for timber traceability, certification and trade control. Yet, this information is still poorly used, and only a handful of experts are able to use it for plant species identification. Here, we present an interactive, user-friendly tool based on vector graphics, illustrating 99 states of 27 wood characters from 110 Amazonian tree species belonging to 34 families. Pl@ntWood is a graphical identification tool based on the IDAO system, a multimedia approach to plant identification. Wood anatomical characters were selected from the IAWA list of microscopic features for hardwood identification, which will enable us to easily extend this work to a larger number of species. A stand-alone application has been developed and an on-line version will be delivered in the near future. Besides allowing non-specialists to identify plants in a user-friendly interface, this system can be used with different purposes such as teaching, conservation, management, and selftraining in the wood anatomy of tropical species.


Parasitology ◽  
1999 ◽  
Vol 119 (3) ◽  
pp. 315-321 ◽  
Author(s):  
A. IMASE ◽  
T. KUMAGAI ◽  
H. OHMAE ◽  
Y. IRIE ◽  
Y. IWAMURA

Localization of the type 2 Alu sequence (B2), a highly repetitive DNA sequence in the mouse genome, was examined by in situ polymerase chain reaction (in situ PCR) in schistosomes. The signals to the B2 sequence were detected in the cytoplasm of the tegumental membrane and in the nuclei of the mesenchymal, testicular, ovarian and vitelline cells of 8- week Schistosoma japonicum. In contrast, it was difficult to detect any signals of this sequence in 8-week S. mansoni, whereas in 24-week male S. mansoni the signals were observed in the cytoplasm of the tegumental tubercles and in the nuclei of the mesenchymal and testicular cells. On the other hand, in 24-week female S. mansoni the signals were found in the nuclei of the mesenchymal, ovarian and vitelline cells but not found in the tegument. On the contrary, no hybridization band of the B2 sequence was detected in the amplified DNA of 3-week schistosomula of either species. These observations proved that the host DNA sequences existed in restricted schistosome cells and were accumulated in the schistosome body during their development.


Sign in / Sign up

Export Citation Format

Share Document