scholarly journals DeePaC: predicting pathogenic potential of novel DNA with reverse-complement neural networks

Author(s):  
Jakub M Bartoszewicz ◽  
Anja Seidel ◽  
Robert Rentzsch ◽  
Bernhard Y Renard

Abstract Motivation We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable. Results We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning. Combining a deep learning approach with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art. Availability and implementation The code and the models are available at: https://gitlab.com/rki_bioinformatics/DeePaC. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Jakub M. Bartoszewicz ◽  
Anja Seidel ◽  
Robert Rentzsch ◽  
Bernhard Y. Renard

AbstractMotivation:We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. What is more, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, limiting their performance on unknown, unrecognized, and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads even though the biological context is unavailable. However, modern neural architectures treat DNA as a simple character string and may predict conflicting labels for a given sequence and its reverse-complement. This undesirable property may impact model performance.Results:We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a universal, extensible framework for neural architectures ensuring identical predictions for any given DNA sequence and its reverse-complement. We implement reverse-complement convolutional neural networks and LSTMs, which outperform the state-of-the-art methods based on both sequence homology and machine learning. Combining a reverse-complement architecture with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art.Availability:The code and the models are available at: https://gitlab.com/rki_bioinformatics/DeePaC


2018 ◽  
Vol 15 (1) ◽  
pp. 6-28 ◽  
Author(s):  
Javier Pérez-Sianes ◽  
Horacio Pérez-Sánchez ◽  
Fernando Díaz

Background: Automated compound testing is currently the de facto standard method for drug screening, but it has not brought the great increase in the number of new drugs that was expected. Computer- aided compounds search, known as Virtual Screening, has shown the benefits to this field as a complement or even alternative to the robotic drug discovery. There are different methods and approaches to address this problem and most of them are often included in one of the main screening strategies. Machine learning, however, has established itself as a virtual screening methodology in its own right and it may grow in popularity with the new trends on artificial intelligence. Objective: This paper will attempt to provide a comprehensive and structured review that collects the most important proposals made so far in this area of research. Particular attention is given to some recent developments carried out in the machine learning field: the deep learning approach, which is pointed out as a future key player in the virtual screening landscape.


Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 39
Author(s):  
Carlos Lassance ◽  
Vincent Gripon ◽  
Antonio Ortega

Deep Learning (DL) has attracted a lot of attention for its ability to reach state-of-the-art performance in many machine learning tasks. The core principle of DL methods consists of training composite architectures in an end-to-end fashion, where inputs are associated with outputs trained to optimize an objective function. Because of their compositional nature, DL architectures naturally exhibit several intermediate representations of the inputs, which belong to so-called latent spaces. When treated individually, these intermediate representations are most of the time unconstrained during the learning process, as it is unclear which properties should be favored. However, when processing a batch of inputs concurrently, the corresponding set of intermediate representations exhibit relations (what we call a geometry) on which desired properties can be sought. In this work, we show that it is possible to introduce constraints on these latent geometries to address various problems. In more detail, we propose to represent geometries by constructing similarity graphs from the intermediate representations obtained when processing a batch of inputs. By constraining these Latent Geometry Graphs (LGGs), we address the three following problems: (i) reproducing the behavior of a teacher architecture is achieved by mimicking its geometry, (ii) designing efficient embeddings for classification is achieved by targeting specific geometries, and (iii) robustness to deviations on inputs is achieved via enforcing smooth variation of geometry between consecutive latent spaces. Using standard vision benchmarks, we demonstrate the ability of the proposed geometry-based methods in solving the considered problems.


2016 ◽  
Vol 21 (9) ◽  
pp. 998-1003 ◽  
Author(s):  
Oliver Dürr ◽  
Beate Sick

Deep learning methods are currently outperforming traditional state-of-the-art computer vision algorithms in diverse applications and recently even surpassed human performance in object recognition. Here we demonstrate the potential of deep learning methods to high-content screening–based phenotype classification. We trained a deep learning classifier in the form of convolutional neural networks with approximately 40,000 publicly available single-cell images from samples treated with compounds from four classes known to lead to different phenotypes. The input data consisted of multichannel images. The construction of appropriate feature definitions was part of the training and carried out by the convolutional network, without the need for expert knowledge or handcrafted features. We compare our results against the recent state-of-the-art pipeline in which predefined features are extracted from each cell using specialized software and then fed into various machine learning algorithms (support vector machine, Fisher linear discriminant, random forest) for classification. The performance of all classification approaches is evaluated on an untouched test image set with known phenotype classes. Compared to the best reference machine learning algorithm, the misclassification rate is reduced from 8.9% to 6.6%.


Author(s):  
Dan Stowell

Terrestrial bioacoustics, like many other domains, has recently witnessed some transformative results from the application of deep learning and big data (Stowell 2017, Mac Aodha et al. 2018, Fairbrass et al. 2018, Mercado III and Sturdy 2017). Generalising over specific projects, which bioacoustic tasks can we consider "solved"? What can we expect in the near future, and what remains hard to do? What does a bioacoustician need to understand about deep learning? This contribution will address these questions, giving the audience a concise summary of recent developments and ways forward. It builds on recent projects and evaluation campaigns led by the author (Stowell et al. 2015, Stowell et al. 2018), as well as broader developments in signal processing, machine learning and bioacoustic applications of these. We will discuss which type of deep learning networks are appropriate for audio data, how to address zoological/ecological applications which often have few available data, and issues in integrating deep learning predictions with existing workflows in statistical ecology.


2018 ◽  
Author(s):  
Gary H. Chang ◽  
David T. Felson ◽  
Shangran Qiu ◽  
Terence D. Capellini ◽  
Vijaya B. Kolachalama

ABSTRACTBackground and objectiveIt remains difficult to characterize pain in knee joints with osteoarthritis solely by radiographic findings. We sought to understand how advanced machine learning methods such as deep neural networks can be used to analyze raw MRI scans and predict bilateral knee pain, independent of other risk factors.MethodsWe developed a deep learning framework to associate information from MRI slices taken from the left and right knees of subjects from the Osteoarthritis Initiative with bilateral knee pain. Model training was performed by first extracting features from two-dimensional (2D) sagittal intermediate-weighted turbo spin echo slices. The extracted features from all the 2D slices were subsequently combined to directly associate using a fused deep neural network with the output of interest as a binary classification problem.ResultsThe deep learning model resulted in predicting bilateral knee pain on test data with 70.1% mean accuracy, 51.3% mean sensitivity, and 81.6% mean specificity. Systematic analysis of the predictions on the test data revealed that the model performance was consistent across subjects of different Kellgren-Lawrence grades.ConclusionThe study demonstrates a proof of principle that a machine learning approach can be applied to associate MR images with bilateral knee pain.SIGNIFICANCE AND INNOVATIONKnee pain is typically considered as an early indicator of osteoarthritis (OA) risk. Emerging evidence suggests that MRI changes are linked to pre-clinical OA, thus underscoring the need for building image-based models to predict knee pain. We leveraged a state-of-the-art machine learning approach to associate raw MR images with bilateral knee pain, independent of other risk factors.


2021 ◽  
Author(s):  
Jakub M. Bartoszewicz ◽  
Ulrich Genske ◽  
Bernhard Y. Renard

AbstractMotivationNovel pathogens evolve quickly and may emerge rapidly, causing dangerous outbreaks or even global pandemics. Next-generation sequencing is the state-of-the art in open-view pathogen detection, and one of the few methods available at the earliest stages of an epidemic, even when the biological threat is unknown. Analyzing the samples as the sequencer is running can greatly reduce the turnaround time, but existing tools rely on close matches to lists of known pathogens and perform poorly on novel species. Machine learning approaches can predict if single reads originate from more distant, unknown pathogens, but require relatively long input sequences and processed data from a finished sequencing run.ResultsWe present DeePaC-Live, a Python package for real-time pathogenic potential prediction directly from incomplete sequencing reads. We train deep neural networks to classify Illumina and Nanopore reads and integrate our models with HiLive2, a real-time Illumina mapper. DeePaC-Live outperforms alternatives based on machine learning and sequence alignment on simulated and real data, including SARS-CoV-2 sequencing runs. After just 50 Illumina cycles, we increase the true positive rate 80-fold compared to the live-mapping approach. The first 250bp of Nanopore reads, corresponding to 0.5s of sequencing time, are enough to yield predictions more accurate than mapping the finished long reads. Our approach could also be used for screening synthetic sequences against biosecurity threats.AvailabilityThe code is available at: https://gitlab.com/dacs-hpi/deepac-live and https://gitlab.com/dacs-hpi/deepac. The package can be installed with Bioconda, Docker or [email protected], [email protected] informationSupplementary data are available online.


Author(s):  
Markus N. Rabe ◽  
Christian Szegedy

AbstractOver the recent years deep learning has found successful applications in mathematical reasoning. Today, we can predict fine-grained proof steps, relevant premises, and even useful conjectures using neural networks. This extended abstract summarizes recent developments of machine learning in mathematical reasoning and the vision of the N2Formal group at Google Research to create an automatic mathematician. The second part discusses the key challenges on the road ahead.


2020 ◽  
Vol 79 (41-42) ◽  
pp. 30387-30395
Author(s):  
Stavros Ntalampiras

Abstract Predicting the emotional responses of humans to soundscapes is a relatively recent field of research coming with a wide range of promising applications. This work presents the design of two convolutional neural networks, namely ArNet and ValNet, each one responsible for quantifying arousal and valence evoked by soundscapes. We build on the knowledge acquired from the application of traditional machine learning techniques on the specific domain, and design a suitable deep learning framework. Moreover, we propose the usage of artificially created mixed soundscapes, the distributions of which are located between the ones of the available samples, a process that increases the variance of the dataset leading to significantly better performance. The reported results outperform the state of the art on a soundscape dataset following Schafer’s standardized categorization considering both sound’s identity and the respective listening context.


2020 ◽  
Author(s):  
Dean Sumner ◽  
Jiazhen He ◽  
Amol Thakkar ◽  
Ola Engkvist ◽  
Esben Jannik Bjerrum

<p>SMILES randomization, a form of data augmentation, has previously been shown to increase the performance of deep learning models compared to non-augmented baselines. Here, we propose a novel data augmentation method we call “Levenshtein augmentation” which considers local SMILES sub-sequence similarity between reactants and their respective products when creating training pairs. The performance of Levenshtein augmentation was tested using two state of the art models - transformer and sequence-to-sequence based recurrent neural networks with attention. Levenshtein augmentation demonstrated an increase performance over non-augmented, and conventionally SMILES randomization augmented data when used for training of baseline models. Furthermore, Levenshtein augmentation seemingly results in what we define as <i>attentional gain </i>– an enhancement in the pattern recognition capabilities of the underlying network to molecular motifs.</p>


Sign in / Sign up

Export Citation Format

Share Document