scholarly journals A deep learning framework combined with word embedding to identify DNA replication origins

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Feng Wu ◽  
Runtao Yang ◽  
Chengjin Zhang ◽  
Lina Zhang

AbstractThe DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable period of time. To simplify the identification process of eukaryote’s ORIs, most of existing methods are developed by traditional machine learning algorithms, and target to the gene sequences with a fixed length. Consequently, the identification results are not satisfying, i.e. there is still great room for improvement. To break through the limitations in previous studies, this paper develops sequence segmentation methods, and employs the word embedding technique, ‘Word2vec’, to convert gene sequences into word vectors, thereby grasping the inner correlations of gene sequences with different lengths. Then, a deep learning framework to perform the ORI identification task is constructed by a convolutional neural network with an embedding layer. On the basis of the analysis of similarity reduction dimensionality diagram, Word2vec can effectively transform the inner relationship among words into numerical feature. For four species in this study, the best models are obtained with the overall accuracy of 0.975, 0.765, 0.885, 0.967, the Matthew’s correlation coefficient of 0.940, 0.530, 0.771, 0.934, and the AUC of 0.975, 0.800, 0.888, 0.981, which indicate that the proposed predictor has a stable ability and provide a high confidence coefficient to classify both of ORIs and non-ORIs. Compared with state-of-the-art methods, the proposed predictor can achieve ORI identification with significant improvement. It is therefore reasonable to anticipate that the proposed method will make a useful high throughput tool for genome analysis.

Electronics ◽  
2021 ◽  
Vol 10 (10) ◽  
pp. 1136
Author(s):  
David Augusto Ribeiro ◽  
Juan Casavílca Silva ◽  
Renata Lopes Rosa ◽  
Muhammad Saadi ◽  
Shahid Mumtaz ◽  
...  

Light field (LF) imaging has multi-view properties that help to create many applications that include auto-refocusing, depth estimation and 3D reconstruction of images, which are required particularly for intelligent transportation systems (ITSs). However, cameras can present a limited angular resolution, becoming a bottleneck in vision applications. Thus, there is a challenge to incorporate angular data due to disparities in the LF images. In recent years, different machine learning algorithms have been applied to both image processing and ITS research areas for different purposes. In this work, a Lightweight Deformable Deep Learning Framework is implemented, in which the problem of disparity into LF images is treated. To this end, an angular alignment module and a soft activation function into the Convolutional Neural Network (CNN) are implemented. For performance assessment, the proposed solution is compared with recent state-of-the-art methods using different LF datasets, each one with specific characteristics. Experimental results demonstrated that the proposed solution achieved a better performance than the other methods. The image quality results obtained outperform state-of-the-art LF image reconstruction methods. Furthermore, our model presents a lower computational complexity, decreasing the execution time.


2019 ◽  
Vol 36 (1) ◽  
pp. 49-55 ◽  
Author(s):  
Chenwei Lou ◽  
Jian Zhao ◽  
Ruoyao Shi ◽  
Qian Wang ◽  
Wenyang Zhou ◽  
...  

AbstractMotivationCell divisions start from replicating the double-stranded DNA, and the DNA replication process needs to be precisely regulated both spatially and temporally. The DNA is replicated starting from the DNA replication origins. A few successful prediction models were generated based on the assumption that the DNA replication origin regions have sequence level features like physicochemical properties significantly different from the other DNA regions.ResultsThis study proposed a feature selection procedure to further refine the classification model of the DNA replication origins. The experimental data demonstrated that as large as 26% improvement in the prediction accuracy may be achieved on the yeast Saccharomyces cerevisiae. Moreover, the prediction accuracies of the DNA replication origins were improved for all the four yeast genomes investigated in this study.Availability and implementationThe software sefOri version 1.0 was available at http://www.healthinformaticslab.org/supp/resources.php. An online server was also provided for the convenience of the users, and its web link may be found in the above-mentioned web page.Supplementary informationSupplementary data are available at Bioinformatics online.


2021 ◽  
Vol 9 (2) ◽  
pp. 169
Author(s):  
Igor Ryazanov ◽  
Amanda T. Nylund ◽  
Debabrota Basu ◽  
Ida-Maja Hassellöv ◽  
Alexander Schliep

Driven by the unprecedented availability of data, machine learning has become a pervasive and transformative technology across industry and science. Its importance to marine science has been codified as one goal of the UN Ocean Decade. While increasing amounts of, for example, acoustic marine data are collected for research and monitoring purposes, and machine learning methods can achieve automatic processing and analysis of acoustic data, they require large training datasets annotated or labelled by experts. Consequently, addressing the relative scarcity of labelled data is, besides increasing data analysis and processing capacities, one of the main thrust areas. One approach to address label scarcity is the expert-in-the-loop approach which allows analysis of limited and unbalanced data efficiently. Its advantages are demonstrated with our novel deep learning-based expert-in-the-loop framework for automatic detection of turbulent wake signatures in echo sounder data. Using machine learning algorithms, such as the one presented in this study, greatly increases the capacity to analyse large amounts of acoustic data. It would be a first step in realising the full potential of the increasing amount of acoustic data in marine sciences.


2021 ◽  
pp. 1063293X2199495
Author(s):  
Eddy Sánchez-DelaCruz ◽  
Juan P Salazar López ◽  
David Lara Alabazares ◽  
Edgar Tello Leal ◽  
Mirta Fuentes-Ramos

Foliar disease is common problem in plants; it appears as an abnormal change in the plant’s characteristics, such as the presence of lesions and discolorations, among others. These problems may be related to plant growth, which causes a decrease in crop production, impacting the agricultural economy. The causes of leaf damage can be variable, such as bacteria, viruses, nutritional deficiencies, or even consequences of climate change. Motivated to find a solution for this problem, we aim that using image processing and machine learning algorithms (MLA), these symptomatic characteristics of the leaf can be used to classify diseases. Then, contributions of this research are (i) the use of image processing methods in the feature extraction (characteristics), and (ii) the combination of assembled algorithms with deep learning to classify foliar features of Valencia orange (Citrus Sinensis) tree leaves. Combining these two classification approaches, we get optimal rates in binary datasets and highly competitive percentages in multiclass sets. This, using a database of images of three types of foliar damage of local plants. Result of combination of these two classification strategies is an exceptional reliable alternative for leaf damage identification of orange and other citrus plants.


2021 ◽  
Author(s):  
Lei Deng ◽  
Wenjuan Nie ◽  
Jiaojiao Zhao ◽  
Jingpu Zhang

Abstract Background: Viral infection and diseases are caused by various viruses involved in the protein-protein interaction (PPI) between virus and host, which are a threat to human health. Studying the virus-host PPI is beneficial to apprehending the mechanism of viral infection and developing new treatment drugs. Although several computational methods for predicting the virus-host PPI have been proposed, most of them are supported by the machine learning algorithms, making the hidden high-level feature difficult to be extracted. Results: We proposed a novel hybrid deep learning framework combined with four CNN layers and LSTM to predict the virus-host PPI only using protein sequence information. CNN can extract the nonlinear position-related features of protein sequence, and LSTM can obtain the long-term relevant information. L1-regularized logistic regression is applied to eliminate the noise and redundant information. Our model achieved the best performance on the benchmark dataset and independent set compared with other existing methods. Conclusion: Our method, through the hybrid deep neural network, is useful for predicting virus-host PPI using protein sequence alone, and achieved the best prediction performance compared with other existing methods, which is promising on the virus-host PPI prediction


Author(s):  
Haodong Xu ◽  
Peilin Jia ◽  
Zhongming Zhao

Abstract DNA N4-methylcytosine (4mC) modification represents a novel epigenetic regulation. It involves in various cellular processes, including DNA replication, cell cycle and gene expression, among others. In addition to experimental identification of 4mC sites, in silico prediction of 4mC sites in the genome has emerged as an alternative and promising approach. In this study, we first reviewed the current progress in the computational prediction of 4mC sites and systematically evaluated the predictive capacity of eight conventional machine learning algorithms as well as 12 feature types commonly used in previous studies in six species. Using a representative benchmark dataset, we investigated the contribution of feature selection and stacking approach to the model construction, and found that feature optimization and proper reinforcement learning could improve the performance. We next recollected newly added 4mC sites in the six species’ genomes and developed a novel deep learning-based 4mC site predictor, namely Deep4mC. Deep4mC applies convolutional neural networks with four representative features. For species with small numbers of samples, we extended our deep learning framework with a bootstrapping method. Our evaluation indicated that Deep4mC could obtain high accuracy and robust performance with the average area under curve (AUC) values greater than 0.9 in all species (range: 0.9005–0.9722). In comparison, Deep4mC achieved an AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. A user-friendly web server (https://bioinfo.uth.edu/Deep4mC) was built for predicting putative 4mC sites in a genome.


2021 ◽  
Vol 22 (7) ◽  
pp. 3589
Author(s):  
Runtao Yang ◽  
Feng Wu ◽  
Chengjin Zhang ◽  
Lina Zhang

As critical components of DNA, enhancers can efficiently and specifically manipulate the spatial and temporal regulation of gene transcription. Malfunction or dysregulation of enhancers is implicated in a slew of human pathology. Therefore, identifying enhancers and their strength may provide insights into the molecular mechanisms of gene transcription and facilitate the discovery of candidate drug targets. In this paper, a new enhancer and its strength predictor, iEnhancer-GAN, is proposed based on a deep learning framework in combination with the word embedding and sequence generative adversarial net (Seq-GAN). Considering the relatively small training dataset, the Seq-GAN is designed to generate artificial sequences. Given that each functional element in DNA sequences is analogous to a “word” in linguistics, the word segmentation methods are proposed to divide DNA sequences into “words”, and the skip-gram model is employed to transform the “words” into digital vectors. In view of the powerful ability to extract high-level abstraction features, a convolutional neural network (CNN) architecture is constructed to perform the identification tasks, and the word vectors of DNA sequences are vertically concatenated to form the embedding matrices as the input of the CNN. Experimental results demonstrate the effectiveness of the Seq-GAN to expand the training dataset, the possibility of applying word segmentation methods to extract “words” from DNA sequences, the feasibility of implementing the skip-gram model to encode DNA sequences, and the powerful prediction ability of the CNN. Compared with other state-of-the-art methods on the training dataset and independent test dataset, the proposed method achieves a significantly improved overall performance. It is anticipated that the proposed method has a certain promotion effect on enhancer related fields.


Kybernetes ◽  
2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Shubham Bharti ◽  
Arun Kumar Yadav ◽  
Mohit Kumar ◽  
Divakar Yadav

PurposeWith the rise of social media platforms, an increasing number of cases of cyberbullying has reemerged. Every day, large number of people, especially teenagers, become the victim of cyber abuse. A cyberbullied person can have a long-lasting impact on his mind. Due to it, the victim may develop social anxiety, engage in self-harm, go into depression or in the extreme cases, it may lead to suicide. This paper aims to evaluate various techniques to automatically detect cyberbullying from tweets by using machine learning and deep learning approaches.Design/methodology/approachThe authors applied machine learning algorithms approach and after analyzing the experimental results, the authors postulated that deep learning algorithms perform better for the task. Word-embedding techniques were used for word representation for our model training. Pre-trained embedding GloVe was used to generate word embedding. Different versions of GloVe were used and their performance was compared. Bi-directional long short-term memory (BLSTM) was used for classification.FindingsThe dataset contains 35,787 labeled tweets. The GloVe840 word embedding technique along with BLSTM provided the best results on the dataset with an accuracy, precision and F1 measure of 92.60%, 96.60% and 94.20%, respectively.Research limitations/implicationsIf a word is not present in pre-trained embedding (GloVe), it may be given a random vector representation that may not correspond to the actual meaning of the word. It means that if a word is out of vocabulary (OOV) then it may not be represented suitably which can affect the detection of cyberbullying tweets. The problem may be rectified through the use of character level embedding of words.Practical implicationsThe findings of the work may inspire entrepreneurs to leverage the proposed approach to build deployable systems to detect cyberbullying in different contexts such as workplace, school, etc and may also draw the attention of lawmakers and policymakers to create systemic tools to tackle the ills of cyberbullying.Social implicationsCyberbullying, if effectively detected may save the victims from various psychological problems which, in turn, may lead society to a healthier and more productive life.Originality/valueThe proposed method produced results that outperform the state-of-the-art approaches in detecting cyberbullying from tweets. It uses a large dataset, created by intelligently merging two publicly available datasets. Further, a comprehensive evaluation of the proposed methodology has been presented.


2020 ◽  
Author(s):  
Jordan Anaya ◽  
John-William Sidhom ◽  
Craig A. Cummings ◽  
Alexander S. Baras ◽  

ABSTRACTDeep learning has the ability to extract meaningful features from data given enough training examples. Large scale genomic data are well suited for this class of machine learning algorithms; however, for many of these data the labels are at the level of the sample instead of at the level of the individual genomic measures. To leverage the power of deep learning for these types of data we turn to a multiple instance learning framework, and present an easily extensible tool built with TensorFlow and Keras. We show how this tool can be applied to somatic variants (featurizing genomic position and sequence context), and accurately classify samples according to whether they contain a specific variant (hotspot or tumor suppressor) or whether they contain a type of variant (microsatellite instability). We then apply our model to the calibration of tumor mutational burden (TMB), an increasingly important metric in the field of immunotherapy, across a variety of commonly used gene panels. Regardless of the panel, we observed improvements in regression to the gold standard whole exome derived value for this metric, with additional performance benefits as more data were provided to the model (such as noncoding variants from panel assays). Our results suggest this framework could lead to improvements in a range of tasks where the sample level metric is determined by the aggregation of a set of genomic measures, such as somatic mutations that we focused on in this study.


2020 ◽  
Author(s):  
Azam Mohsin ◽  
Stephen Arnovitz ◽  
Aly A Khan ◽  
Fotini Gounari

AbstractAll life forms undergo cell division and are dependent on faithful DNA replication to maintain the stability of their genomes. Both intrinsic and extrinsic factors can stress the replication process and multiple checkpoint mechanisms have evolved to ensure genome stability. Understanding these molecular mechanisms is crucial for preventing and treating genomic instability associated diseases including cancer. DNA replicating fiber fluorography is a powerful technique that directly visualizes the replication process and a cell’s response to replication stress. Analysis of DNA-fiber microscopy images provides quantitative information about replication fitness. However, a bottleneck for high throughput DNA-fiber studies is that quantitative measurements are laborious when performed manually. Here we introduce FiberAI, which uses state-of-the art deep learning frameworks to detect and quantify DNA-fibers in high throughput microscopy images. FiberAI efficiently detects DNA fibers, achieving a bounding box average precision score of 0.91 and a segmentation average precision score of 0.90. We then use FiberAI to measure the integrity of replication checkpoints. FiberAI is publicly available and allows users to view model predicted selections, add their own manual selections, and easily analyze multiple image sets. Thus, FiberAI can help elucidate DNA replication processes by streamlining DNA-fiber analyses.


Sign in / Sign up

Export Citation Format

Share Document