scholarly journals Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams

Author(s):  
Nguyen Quoc Khanh Le ◽  
Edward Kien Yee Yapp ◽  
N. Nagasundaram ◽  
Hui-Yuan Yeh
2020 ◽  
Vol 26 ◽  
Author(s):  
Xiaoping Min ◽  
Fengqing Lu ◽  
Chunyan Li

: Enhancer-promoter interactions (EPIs) in the human genome are of great significance to transcriptional regulation which tightly controls gene expression. Identification of EPIs can help us better deciphering gene regulation and understanding disease mechanisms. However, experimental methods to identify EPIs are constrained by the fund, time and manpower while computational methods using DNA sequences and genomic features are viable alternatives. Deep learning methods have shown promising prospects in classification and efforts that have been utilized to identify EPIs. In this survey, we specifically focus on sequence-based deep learning methods and conduct a comprehensive review of the literatures of them. We first briefly introduce existing sequence-based frameworks on EPIs prediction and their technique details. After that, we elaborate on the dataset, pre-processing means and evaluation strategies. Finally, we discuss the challenges these methods are confronted with and suggest several future opportunities.


2020 ◽  
Author(s):  
Yupeng Wang ◽  
Rosario B. Jaime-Lara ◽  
Abhrarup Roy ◽  
Ying Sun ◽  
Xinyue Liu ◽  
...  

AbstractWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.


2019 ◽  
Author(s):  
Ardi Tampuu ◽  
Zurab Bzhalava ◽  
Joakim Dillner ◽  
Raul Vicente

ABSTRACTDespite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as “unknown” since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as “unknown” by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases.


2020 ◽  
Author(s):  
Shabir Moosa ◽  
Abbes Amira ◽  
Sabri Boughorbel

Abstract Background: The data explosion caused by unprecedented advancements in the field of genomics is constantly challenging the conventional methods used in the interpretation of the human genome. The demand for robust algorithms over the recent years has brought huge success in the field of Deep Learning (DL) in solving many difficult tasks in image, speech and natural language processing by automating the manual process of architecture design. This has been fueled through the development of new DL architectures. Yet genomics possesses unique challenges as we expect DL to provide a super human intelligence that easily interprets a human genome.Methods: We adapted a differential architecture search method for the interpretation of biological sequences and applied it to the splice site recognition task on DNA sequences to discover new high-performance convolutional architectures in an automated manner. The discovered architecture was benchmarked on CPU and multiple GPU architectures in terms of computational time and classification performance.Results: Our experimental evaluation demonstrated that the discovered architecture outperformed fixed baseline architectures for classification of splice sites. The benchmarking experiments of execution time and precision on architecture search and evaluation process showed that they performed better on recently available GPU models.Conclusions: We applied differential architecture search mechanism to perform splice site classification on raw DNA sequences, and discovered new models with better performance than major baseline models. The results have shown a potential of using this automated architecture search mechanism for solving various problems in genomics domain.


Symmetry ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 1599
Author(s):  
Lina Jin ◽  
Jiong Yu ◽  
Xiaoqian Yuan ◽  
Xusheng Du

Fish is one of the most extensive distributed organisms in the world. Fish taxonomy is an important component of biodiversity and the basis of fishery resources management. The DNA barcode based on a short sequence fragment is a valuable molecular tool for fish classification. However, the high dimensionality of DNA barcode sequences and the limitation of the number of fish species make it difficult to reasonably analyze the DNA sequences and correctly classify fish from different families. In this paper, we propose a novel deep learning method that fuses Elastic Net-Stacked Autoencoder (EN-SAE) with Kernel Density Estimation (KDE), named ESK model. In stage one, the ESK preprocesses original data from DNA barcode sequences. In stage two, EN-SAE is used to learn the deep features and obtain the outgroup score of each fish. In stage three, KDE is used to select a threshold based on the outgroup scores and classify fish from different families. The effectiveness and superiority of ESK have been validated by experiments on three datasets, with the accuracy, recall, F1-Score reaching 97.57%, 97.43%, and 98.96% on average. Those findings confirm that ESK can accurately classify fish from different families based on DNA barcode sequences.


2020 ◽  
Vol 73 (4) ◽  
pp. 275-284
Author(s):  
Dukyong Yoon ◽  
Jong-Hwan Jang ◽  
Byung Jin Choi ◽  
Tae Young Kim ◽  
Chang Ho Han

Biosignals such as electrocardiogram or photoplethysmogram are widely used for determining and monitoring the medical condition of patients. It was recently discovered that more information could be gathered from biosignals by applying artificial intelligence (AI). At present, one of the most impactful advancements in AI is deep learning. Deep learning-based models can extract important features from raw data without feature engineering by humans, provided the amount of data is sufficient. This AI-enabled feature presents opportunities to obtain latent information that may be used as a digital biomarker for detecting or predicting a clinical outcome or event without further invasive evaluation. However, the black box model of deep learning is difficult to understand for clinicians familiar with a conventional method of analysis of biosignals. A basic knowledge of AI and machine learning is required for the clinicians to properly interpret the extracted information and to adopt it in clinical practice. This review covers the basics of AI and machine learning, and the feasibility of their application to real-life situations by clinicians in the near future.


2018 ◽  
Author(s):  
Reem Elsousy ◽  
Nagarajan Kathiresan ◽  
Sabri Boughorbel

AbstractThe success of deep learning has been shown in various fields including computer vision, speech recognition, natural language processing and bioinformatics. The advance of Deep Learning in Computer Vision has been an important source of inspiration for other research fields. The objective of this work is to adapt known deep learning models borrowed from computer vision such as VGGNet, Resnet and AlexNet for the classification of biological sequences. In particular, we are interested by the task of splice site identification based on raw DNA sequences. We focus on the role of model architecture depth on model training and classification performance.We show that deep learning models outperform traditional classification methods (SVM, Random Forests, and Logistic Regression) for large training sets of raw DNA sequences. Three model families are analyzed in this work namely VGGNet, AlexNet and ResNet. Three depth levels are defined for each model family. The models are benchmarked using the following metrics: Area Under ROC curve (AUC), Number of model parameters, number of floating operations. Our extensive experimental evaluation show that shallow architectures have an overall better performance than deep models. We introduced a shallow version of ResNet, named S-ResNet. We show that it gives a good trade-off between model complexity and classification performance.Author summaryDeep Learning has been widely applied to various fields in research and industry. It has been also succesfully applied to genomics and in particular to splice site identification. We are interested in the use of advanced neural networks borrowed from computer vision. We explored well-known models and their usability for the problem of splice site identification from raw sequences. Our extensive experimental analysis shows that shallow models outperform deep models. We introduce a new model called S-ResNet, which gives a good trade-off between computational complexity and classification accuracy.


2021 ◽  
Author(s):  
Li Ye ◽  
Chunquan Li ◽  
Jiquan Ma

The identification of enhancers has always been an important task in bioinformatics owing to their major role in regulating gene expression. For this reason, many computational algorithms devoted to enhancer identification have been put forward over the years, ranging from statistics and machine learning to the increasing popular deep learning. To boost the performance of their methods, more features tend to be extracted from the single DNA sequences and integrated to develop an ensemble classifier. Nevertheless, the sequence-derived features used in previous studies can hardly provide the 3D structure information of DNA sequences, which is regarded as an important factor affecting the binding preferences of transcription factors to regulatory elements like enhancers. Given that, we here propose DENIES, a deep learning based two-layer predictor for enhancing the identification of enhancers and their strength. Besides two common sequence-derived features (i.e. one-hot and k-mer), it introduces DNA shape for describing the 3D structures of DNA sequences. The results of performance comparison with a series of state-of-the-art methods conducted on the same datasets prove the effectiveness and robustness of our method. The code implementation of our predictor is freely available at https://github.com/hlju-liye/DENIES.


2021 ◽  
Author(s):  
Pablo Millan Arias ◽  
Fatemeh Alipour ◽  
Kathleen Hill ◽  
Lila Kari

We present a novel Deep Learning method for the Unsupervised Classification of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Chaos Game Representations (CGRs) of primary DNA sequences, and generates “mimic” sequence CGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster label for each sequence. DeLUCS is able to cluster large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into families; three viral genome and gene datasets, averaging 1,300 sequences each, into virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means and Gaussian Mixture Models) for unlabelled data, by as much as 48%. DeLUCS is highly effective, it is able to classify datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence classification for previously unclassifiable datasets.


Sign in / Sign up

Export Citation Format

Share Document