scholarly journals DeepHINT: Understanding HIV-1 integration via deep learning with attention

2018 ◽  
Author(s):  
Hailin Hu ◽  
An Xiao ◽  
Sai Zhang ◽  
Yangyang Li ◽  
Xuanling Shi ◽  
...  

AbstractMotivationHuman immunodeficiency virus type 1 (HIV-1) genome integration is closely related to clinical latency and viral rebound. In addition to human DNA sequences that directly interact with the integration machinery, the selection of HIV integration sites has also been shown to depend on the heterogeneous genomic context around a large region, which greatly hinders the prediction and mechanistic studies of HIV integration.ResultsWe have developed an attention-based deep learning framework, named DeepHINT, to simultaneously provide accurate prediction of HIV integration sites and mechanistic explanations of the detected sites. Extensive tests on a high-density HIV integration site dataset showed that DeepHINT can outperform conventional modeling strategies by automatically learning the genomic context of HIV integration solely from primary DNA sequence information. Systematic analyses on diverse known factors of HIV integration further validated the biological relevance of the prediction result. More importantly, in-depth analyses of the attention values output by DeepHINT revealed intriguing mechanistic implications in the selection of HIV integration sites, including potential roles of several basic helix-loop-helix (bHLH) transcription factors and zinc-finger proteins. These results established DeepHINT as an effective and explainable deep learning framework for the prediction and mechanistic study of HIV integration.AvailabilityDeepHINT is available as an open-source software and can be downloaded fromhttps://github.com/nonnerdling/[email protected]@tsinghua.edu.cn

2019 ◽  
Vol 20 (5) ◽  
pp. 1070 ◽  
Author(s):  
Cheng Peng ◽  
Siyu Han ◽  
Hui Zhang ◽  
Ying Li

Non-coding RNAs (ncRNAs) play crucial roles in multiple fundamental biological processes, such as post-transcriptional gene regulation, and are implicated in many complex human diseases. Mostly ncRNAs function by interacting with corresponding RNA-binding proteins. The research on ncRNA–protein interaction is the key to understanding the function of ncRNA. However, the biological experiment techniques for identifying RNA–protein interactions (RPIs) are currently still expensive and time-consuming. Due to the complex molecular mechanism of ncRNA–protein interaction and the lack of conservation for ncRNA, especially for long ncRNA (lncRNA), the prediction of ncRNA–protein interaction is still a challenge. Deep learning-based models have become the state-of-the-art in a range of biological sequence analysis problems due to their strong power of feature learning. In this study, we proposed a hierarchical deep learning framework RPITER to predict RNA–protein interaction. For sequence coding, we improved the conjoint triad feature (CTF) coding method by complementing more primary sequence information and adding sequence structure information. For model design, RPITER employed two basic neural network architectures of convolution neural network (CNN) and stacked auto-encoder (SAE). Comprehensive experiments were performed on five benchmark datasets from PDB and NPInter databases to analyze and compare the performances of different sequence coding methods and prediction models. We found that CNN and SAE deep learning architectures have powerful fitting abilities for the k-mer features of RNA and protein sequence. The improved CTF coding method showed performance gain compared with the original CTF method. Moreover, our designed RPITER performed well in predicting RNA–protein interaction (RPI) and could outperform most of the previous methods. On five widely used RPI datasets, RPI369, RPI488, RPI1807, RPI2241 and NPInter, RPITER obtained A U C of 0.821, 0.911, 0.990, 0.957 and 0.985, respectively. The proposed RPITER could be a complementary method for predicting RPI and constructing RPI network, which would help push forward the related biological research on ncRNAs and lncRNAs.


2021 ◽  
Author(s):  
Pora Kim ◽  
Hua Tan ◽  
Jiajia Liu ◽  
Megnyuan Yang ◽  
Xiaobo Zhou

Identifying the molecular mechanisms related to genomic breakage is an important goal of cancer mechanism studies. Among the diverse location of the breakpoints of structural variants, the fusion genes, which have the breakpoints in the gene bodies and typically identified from RNA-seq data, can provide a highlighted structural variant resource for studying the genomic breakages with expression and potential pathogenic impacts. In this study, we developed FusionAI which utilizes deep learning to predict gene fusion breakpoints based on primary sequences and let us identify fusion breakage code and genomic context. FusionAI leverages the known fusion breakpoints to provide a prediction model of the fusion genes from the primary genomic sequences via deep learning, thereby helping researchers a more accurate selection of fusion genes and better understand genomic breakage.


2021 ◽  
Author(s):  
Lei Deng ◽  
Wenjuan Nie ◽  
Jiaojiao Zhao ◽  
Jingpu Zhang

Abstract Background: Viral infection and diseases are caused by various viruses involved in the protein-protein interaction (PPI) between virus and host, which are a threat to human health. Studying the virus-host PPI is beneficial to apprehending the mechanism of viral infection and developing new treatment drugs. Although several computational methods for predicting the virus-host PPI have been proposed, most of them are supported by the machine learning algorithms, making the hidden high-level feature difficult to be extracted. Results: We proposed a novel hybrid deep learning framework combined with four CNN layers and LSTM to predict the virus-host PPI only using protein sequence information. CNN can extract the nonlinear position-related features of protein sequence, and LSTM can obtain the long-term relevant information. L1-regularized logistic regression is applied to eliminate the noise and redundant information. Our model achieved the best performance on the benchmark dataset and independent set compared with other existing methods. Conclusion: Our method, through the hybrid deep neural network, is useful for predicting virus-host PPI using protein sequence alone, and achieved the best prediction performance compared with other existing methods, which is promising on the virus-host PPI prediction


2019 ◽  
Author(s):  
Hannah O. Ajoge ◽  
Tyler M. Renner ◽  
Kasandra Bélanger ◽  
Hinissan P. Kohio ◽  
Macon D. Coleman ◽  
...  

ABSTRACTAPOBEC3 (A3) proteins are host-encoded deoxycytidine deaminases that provide an innate immune barrier to retroviral infection, notably against HIV-1. While the catalytic activity of these proteins can induce catastrophic hypermutation in proviral DNA leading to near-total restriction of infection, sublethal levels of deamination contribute to the genetic evolution of HIV-1. So far, little is known about how A3 might impact HIV-1 integrations into human chromosomal DNA. Using a deep sequencing approach, we analyzed the influence A3F and A3G on HIV-1 integration site selections. DNA editing was detected at the extremities of the long terminal repeat regions of the virus. Both catalytic active and non-catalytic A3 enzymes decreased insertions into gene coding sequences and increased integration sites into SINE elements, oncogenes and transcription-silencing non-B DNA features. Our data implicate A3 as host factors that influence HIV-1 integration site selection and promote insertions into genomic sites that are transcriptionally less active.GRAPHICAL ABSTRACTSchematic depicting the influence of APOBEC3 (A3) proteins on HIV integration site targeting.Left, in the absence of A3, HIV has a strong preference for integrating into genes. Right, both catalytic active and non-catalytic A3 mutants decrease integration into genes and increase integration into SINE elements and in transcription-silencing non-B DNA features.


2019 ◽  
Author(s):  
Yi Guo ◽  
Xiang Chen

AbstractMotivationAlmost all critical functions and processes in cells are sustained by the cellular networks of protein-protein interactions (PPIs), understanding these is therefore crucial in the investigation of biological systems. Despite all past efforts, we still lack high-quality PPI data for constructing the networks, which makes it challenging to study the functions of association of proteins. High-throughput experimental techniques have produced abundant data for systematically studying the cellular networks of a biological system and the development of computational method for PPI identification.ResultsWe have developed a deep learning-based framework, named iPPI, for accurately predicting PPI on a proteome-wide scale depended only on sequence information. iPPI integrates the amino acid properties and compositions of protein sequence into a unified prediction framework using a hybrid deep neural network. Extensive tests demonstrated that iPPI can greatly outperform the state-of-the-art prediction methods in identifying PPIs. In addition, the iPPI prediction score can be related to the strength of protein-protein binding affinity and further showed the biological relevance of our deep learning framework to identify PPIs.Availability and ImplementationiPPI is available as an open-source software and can be downloaded from https://github.com/model-lab/[email protected]


2021 ◽  
Vol 22 (7) ◽  
pp. 3589
Author(s):  
Runtao Yang ◽  
Feng Wu ◽  
Chengjin Zhang ◽  
Lina Zhang

As critical components of DNA, enhancers can efficiently and specifically manipulate the spatial and temporal regulation of gene transcription. Malfunction or dysregulation of enhancers is implicated in a slew of human pathology. Therefore, identifying enhancers and their strength may provide insights into the molecular mechanisms of gene transcription and facilitate the discovery of candidate drug targets. In this paper, a new enhancer and its strength predictor, iEnhancer-GAN, is proposed based on a deep learning framework in combination with the word embedding and sequence generative adversarial net (Seq-GAN). Considering the relatively small training dataset, the Seq-GAN is designed to generate artificial sequences. Given that each functional element in DNA sequences is analogous to a “word” in linguistics, the word segmentation methods are proposed to divide DNA sequences into “words”, and the skip-gram model is employed to transform the “words” into digital vectors. In view of the powerful ability to extract high-level abstraction features, a convolutional neural network (CNN) architecture is constructed to perform the identification tasks, and the word vectors of DNA sequences are vertically concatenated to form the embedding matrices as the input of the CNN. Experimental results demonstrate the effectiveness of the Seq-GAN to expand the training dataset, the possibility of applying word segmentation methods to extract “words” from DNA sequences, the feasibility of implementing the skip-gram model to encode DNA sequences, and the powerful prediction ability of the CNN. Compared with other state-of-the-art methods on the training dataset and independent test dataset, the proposed method achieves a significantly improved overall performance. It is anticipated that the proposed method has a certain promotion effect on enhancer related fields.


2021 ◽  
Vol 17 (8) ◽  
pp. e1009247
Author(s):  
Frances L. Heredia ◽  
Abiel Roche-Lima ◽  
Elsie I. Parés-Matos

The selection of a DNA aptamer through the Systematic Evolution of Ligands by EXponential enrichment (SELEX) method involves multiple binding steps, in which a target and a library of randomized DNA sequences are mixed for selection of a single, nucleotide-specific molecule. Usually, 10 to 20 steps are required for SELEX to be completed. Throughout this process it is necessary to discriminate between true DNA aptamers and unspecified DNA-binding sequences. Thus, a novel machine learning-based approach was developed to support and simplify the early steps of the SELEX process, to help discriminate binding between DNA aptamers from those unspecified targets of DNA-binding sequences. An Artificial Intelligence (AI) approach to identify aptamers were implemented based on Natural Language Processing (NLP) and Machine Learning (ML). NLP method (CountVectorizer) was used to extract information from the nucleotide sequences. Four ML algorithms (Logistic Regression, Decision Tree, Gaussian Naïve Bayes, Support Vector Machines) were trained using data from the NLP method along with sequence information. The best performing model was Support Vector Machines because it had the best ability to discriminate between positive and negative classes. In our model, an Accuracy (A) of 0.995, the fraction of samples that the model correctly classified, and an Area Under the Receiving Operating Curve (AUROC) of 0.998, the degree by which a model is capable of distinguishing between classes, were observed. The developed AI approach is useful to identify potential DNA aptamers to reduce the amount of rounds in a SELEX selection. This new approach could be applied in the design of DNA libraries and result in a more efficient and faster process for DNA aptamers to be chosen during SELEX.


2021 ◽  
Author(s):  
Michael Alexander Suarez Vasquez ◽  
Mingyi Xue ◽  
Jordy Homing Lam ◽  
Eshani C Goonetilleke ◽  
Xin Gao ◽  
...  

Fragment-based drug design plays an important role in the drug discovery process by reducing the complex small-molecule space into a more manageable fragment space. We leverage the power of deep learning to design ChemPLAN-Net; a model that incorporates the pairwise association of physicochemical features of both the protein drug targets and the inhibitor and learns from thousands of protein co-crystal structures in the PDB database to predict previously unseen inhibitor fragments. Our novel protocol handles the computationally challenging multi-label, multi-class problem, by defining a fragment database and using an iterative feature-pair binary classification approach. By training ChemPLAN-Net on available co-crystal structures of the protease protein family, excluding HIV-1 protease as a target, we are able to outperform fragment docking and recover the target's inhibitor fragments found in co-crystal structures or identified by in-vitro cell assays.


Sign in / Sign up

Export Citation Format

Share Document