Accurate deep learning off-target prediction with novel sgRNA-DNA sequence encoding in CRISPR-Cas9 gene editing

Author(s):  
Jeremy Charlier ◽  
Robert Nadon ◽  
Vladimir Makarenkov

Abstract Motivation Off-target predictions are crucial in gene editing research. Recently, significant progress has been made in the field of prediction of off-target mutations, particularly with CRISPR-Cas9 data, thanks to the use of deep learning. CRISPR-Cas9 is a gene editing technique which allows manipulation of DNA fragments. The sgRNA-DNA (single guide RNA-DNA) sequence encoding for deep neural networks, however, has a strong impact on the prediction accuracy. We propose a novel encoding of sgRNA-DNA sequences that aggregates sequence data with no loss of information. Results In our experiments, we compare the proposed sgRNA-DNA sequence encoding applied in a deep learning prediction framework with state-of-the-art encoding and prediction methods. We demonstrate the superior accuracy of our approach in a simulation study involving Feedforward Neural Networks (FNNs), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) as well as the traditional Random Forest (RF), Naive Bayes (NB) and Logistic Regression (LR) classifiers.We highlight the quality of our results by building several FNNs, CNNs and RNNs with various layer depths and performing predictions on two popular CRISPOR and GUIDE-seq gene editing data sets. In all our experiments, the new encoding led to more accurate off-target prediction results, providing an improvement of the area under the Receiver Operating Characteristic (ROC) curve up to 35%. Availability The code and data used in this study are available at: https://github.com/dagrate/dl-offtarget

2020 ◽  
Vol 15 ◽  
Author(s):  
Zhihua Du ◽  
Xiangdong Xiao ◽  
Vladimir N. Uversky

: Chromosomal DNA contains most of the genetic information of eukaryotes and plays an important role in the growth, development and reproduction of living organisms. Most chromosomal DNA sequences are known to wrap around histones, and distinguishing these DNA sequences from ordinary DNA sequences is important for understanding the genetic code of life. The main difficulty behind this problem is the feature selection process. DNA sequences have no explicit features, and the common representation methods, such as one-hot coding, introduced the major drawback of high dimensionality. Recently, deep learning models have been proved to be able to automatically extract useful features from input patterns. In this paper, we present four different deep learning architectures using convolutional neural networks and long short-term memory networks for the purpose of chromosomal DNA sequence classification. Natural language model(Word2vec)was used to generate word embedding of sequence and learn features from it by deep learning. The comparison of these four architectures is carried out on 10 chromosomal DNA datasets. The results show that the architecture of convolutional neural networks combined with long short-term memory networks is superior to other methods in accuracy of chromosomal DNA prediction.


Genetics ◽  
1993 ◽  
Vol 134 (4) ◽  
pp. 1195-1204
Author(s):  
S Tarès ◽  
J M Cornuet ◽  
P Abad

Abstract An AluI family of highly reiterated nontranscribed sequences has been found in the genome of the honeybee Apis mellifera. This repeated sequence is shown to be present at approximately 23,000 copies per haploid genome constituting about 2% of the total genomic DNA. The nucleotide sequence of 10 monomers was determined. The consensus sequences is 176 nucleotides long and has an A + T content of 58%. There are clusters of both direct and inverted repeats. Internal subrepeating units ranging from 11 to 17 nucleotides are observed, suggesting that it could have evolved from a shorter sequence. DNA sequence data reveal that this repeat class is unusually homogeneous compared to the other class of invertebrate highly reiterated DNA sequences. The average pairwise sequence divergence between the repeats is 2.5%. In spite of this unusual homogeneity, divergence has been found in the repeated sequence hybridization ladder between four different honeybee subspecies. Therefore, the AluI highly reiterated sequences provide a new probe for fingerprinting in A. m. mellifera.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
J. M. Torres ◽  
R. M. Aguilar

Making every component of an electrical system work in unison is being made more challenging by the increasing number of renewable energies used, the electrical output of which is difficult to determine beforehand. In Spain, the daily electricity market opens with a 12-hour lead time, where the supply and demand expected for the following 24 hours are presented. When estimating the generation, energy sources like nuclear are highly stable, while peaking power plants can be run as necessary. Renewable energies, however, which should eventually replace peakers insofar as possible, are reliant on meteorological conditions. In this paper we propose using different deep-learning techniques and architectures to solve the problem of predicting wind generation in order to participate in the daily market, by making predictions 12 and 36 hours in advance. We develop and compare various estimators based on feedforward, convolutional, and recurrent neural networks. These estimators were trained and validated with data from a wind farm located on the island of Tenerife. We show that the best candidates for each type are more precise than the reference estimator and the polynomial regression currently used at the wind farm. We also conduct a sensitivity analysis to determine which estimator type is most robust to perturbations. An analysis of our findings shows that the most accurate and robust estimators are those based on feedforward neural networks with a SELU activation function and convolutional neural networks.


2021 ◽  
Author(s):  
Florian Störtz ◽  
Jeffrey Mak ◽  
Peter Minary

CRISPR/Cas programmable nuclease systems have become ubiquitous in the field of gene editing. With progressing development, applications in in vivo therapeutic gene editing are increasingly within reach, yet limited by possible adverse side effects from unwanted edits. Recent years have thus seen continuous development of off-target prediction algorithms trained on in vitro cleavage assay data gained from immortalised cell lines. Here, we implement novel deep learning algorithms and feature encodings for off-target prediction and systematically sample the resulting model space in order to find optimal models and inform future modelling efforts. We lay emphasis on physically informed features, hence terming our approach piCRISPR, which we gain on the large, diverse crisprSQL off-target cleavage dataset. We find that our best-performing model highlights the importance of sequence context and chromatin accessibility for cleavage prediction and outperforms state-of-the-art prediction algorithms in terms of area under precision-recall curve.


2019 ◽  
Vol 37 (5) ◽  
pp. 1495-1507 ◽  
Author(s):  
Zhengting Zou ◽  
Hongjiu Zhang ◽  
Yuanfang Guan ◽  
Jianzhi Zhang

Abstract Phylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification or insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex nonlinear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. Furthermore, when combined with the quartet puzzling algorithm, residual network predictors can be used to reconstruct trees with more than four taxa. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl; last accessed January 3, 2020).


Zootaxa ◽  
2012 ◽  
Vol 3361 (1) ◽  
pp. 56-62 ◽  
Author(s):  
JOSEFINA CURIEL ◽  
JUAN J. MORRONE

Insect life stages are known imperfectly in many cases, and classifications are usually based on adult morphology. This isunfortunate as information on other life stages may be useful for biomonitoring. The major impediment to using elmid(Coleoptera) larvae for freshwater biomonitoring is the lack of larval descriptions and illustrations. Reliable molecular proto-cols may be used to associate larvae and adults. After adults of seven species of Mexican Macrelmis were identified morpho-logically, seven larval specimens were associated to them based on two gene fragments: Cox1 and Cob. The phylogeneticanalysis allowed identifying the larval specimens as Macrelmis leonilae, M. scutellaris, M. species 7, M. species 10, and M.species 11. Two species based on adults associated uncertainly with one larva, and one larva did not match with any adult. Adult/larval association in elmids using DNA sequence data seems to be promising in terms of speed and reliability.


2021 ◽  
Vol 17 (9) ◽  
pp. e1009345
Author(s):  
Zhengqiao Zhao ◽  
Stephen Woloszynek ◽  
Felix Agbavor ◽  
Joshua Chang Mell ◽  
Bahrad A. Sokhansanj ◽  
...  

Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).


2021 ◽  
Vol 12 ◽  
Author(s):  
Ying Zhang ◽  
Yupei Zhou ◽  
Wei Sun ◽  
Lili Zhao ◽  
D. Pavlic-Zupanc ◽  
...  

The genus Botryosphaeria includes more than 200 epithets, but only the type species, Botryosphaeria dothidea and a dozen or more other species have been identified based on DNA sequence data. The taxonomic status of the other species remains unconfirmed because they lack either morphological information or DNA sequence data. In this study, types or authentic specimens of 16 “Botryosphaeria” species are reassessed to clarify their identity and phylogenetic position. nuDNA sequences of four regions, ITS, LSU, tef1-α and tub2, are analyzed and considered in combination with morphological characteristics. Based on the multigene phylogeny and morphological characters, Botryosphaeria cruenta and Botryosphaeria hamamelidis are transferred to Neofusicoccum. The generic status of Botryosphaeria aterrima and Botryosphaeria mirabile is confirmed in Botryosphaeria. Botryosphaeria berengeriana var. weigeliae and B. berengeriana var. acerina are treated synonyms of B. dothidea. Botryosphaeria mucosa is transferred to Neodeightonia as Neodeightonia mucosa, and Botryosphaeria ferruginea to Nothophoma as Nothophoma ferruginea. Botryosphaeria foliicola is reduced to synonymy with Phyllachorella micheliae. Botryosphaeria abuensis, Botryosphaeria aesculi, Botryosphaeria dasylirii, and Botryosphaeria wisteriae are tentatively kept in Botryosphaeria sensu stricto until further phylogenetic analysis is carried out on verified specimens. The ordinal status of Botryosphaeria apocyni, Botryosphaeria gaubae, and Botryosphaeria smilacinina cannot be determined, and tentatively accommodate these species in Dothideomycetes incertae sedis. The study demonstrates the significance of a polyphasic approach in characterizing type specimens, including the importance of using of DNA sequence data.


2020 ◽  
Author(s):  
Alisson Hayasi da Costa ◽  
Renato Augusto C. dos Santos ◽  
Ricardo Cerri

AbstractPIWI-Interacting RNAs (piRNAs) form an important class of non-coding RNAs that play a key role in the genome integrity through the silencing of transposable elements. However, despite their importance and the large application of deep learning in computational biology for classification tasks, there are few studies of deep learning and neural networks for piRNAs prediction. Therefore, this paper presents an investigation on deep feedforward networks models for classification of transposon-derived piRNAs. We analyze and compare the results of the neural networks in different hyperparameters choices, such as number of layers, activation functions and optimizers, clarifying the advantages and disadvantages of each configuration. From this analysis, we propose a model for human piRNAs classification and compare our method with the state-of-the-art deep neural network for piRNA prediction in the literature and also traditional machine learning algorithms, such as Support Vector Machines and Random Forests, showing that our model has achieved a great performance with an F-measure value of 0.872, outperforming the state-of-the-art method in the literature.


2021 ◽  
Author(s):  
Tony Zeng ◽  
Yang I Li

Recent progress in deep learning approaches have greatly improved the prediction of RNA splicing from DNA sequence. Here, we present Pangolin, a deep learning model to predict splice site strength in multiple tissues that has been trained on RNA splicing and sequence data from four species. Pangolin outperforms state of the art methods for predicting RNA splicing on a variety of prediction tasks. We use Pangolin to study the impact of genetic variants on RNA splicing, including lineage-specific variants and rare variants of uncertain significance. Pangolin predicts loss-of-function mutations with high accuracy and recall, particularly for mutations that are not missense or nonsense (AUPRC = 0.93), demonstrating remarkable potential for identifying pathogenic variants.


Sign in / Sign up

Export Citation Format

Share Document